EDUCATIONAL 
MEASUREMENT 

Where Are We Going and 

How Will We Know 

When We Get There? 



John W. Wick 



ALBERT ft MANN 

LIBRARY 

AT 

CORNELL UNIVERSITY 



Cornell University Library 
LB3051.W63 



Educational measurement:where are we goi 





■'3 1924 01 3 096 023 




Cornell University 
Library 



The original of tiiis book is in 
tine Cornell University Library. 

There are no known copyright restrictions in 
the United States on the use of the text. 



http://www.archive.org/details/cu31 92401 3096023 



Educational Measurement 



Educational Measurement 



Where Are We Going and 
How Will We Know When We Get There? 



John W. Wick 

Northwestern University 



CHARLES E. MERRILL PUBLISHING COMPANY 
A Bell & Howell Company 
Columbus, Ohio 



Published by 

Charles E. Merrill Publishing Company 
A Bell & Howell Company 
Columbus, Ohio 43216 



Copyright © 1973 by Bell & Howell Company. All rights reserved. No 
part of this book may be reproduced in any form, electronic or mechani- 
cal, including photocopy, recording, or any information storage and 
retreival system, without permission in writing from the publisher. 

ISBN: 0-675-08947-6 

Library of Congress Catalog Card Number: 72-97836 

1 2 3 4 5 6 / 78 77 76 75 74 73 



to 
2, OS/ 

\a/63 



Printed in the United States of America 



Contents 

Preface ^" 

1 "... and in Arithmetic, I want you to focus a little 
more on problem solving this year" 1 

2 Thinking Positively About Behavioral Objectives and 
Learning to Write Them 17 

In Support of Behavioral Objectives, 43; Summary, 44 

3 Behavioral Objectives: The Other Side of the Coin 46 

Fairness, 46; Practicality, 52; Cautionary Notes, 54 

4 Classroom Tests : A Survey of the Terrain 61 

Comprehensiveness and Acceptability of the Objectives, 
62; The Comprehensiveness of the Measure, 70; Test Ad- 
ministration Decisions: Who Is to Be Evaluated? When? 
76; In-level Testing, 80; What Are Your Expectations for 
the Students? 83; Reporting the Results, 87; Now, How 
Do All These Things Fit Together? 88; Evaluation in the 
Classroom, 88 

5 Modes of Transportation: Questioning Formats in the 
Classroom 92 

Performance Measures, 93; Unobtrusive Measures, 96; 
General Considerations in Achievement Testing, 99; Sup- 
ply-Type Questions, 102; Non-Supply-Type Questions, 
108; Summary, 125 

6 Classroom Applications 128 

7 How Will We Know When We Get There? Standard- 
ized Measures 141 

The World of Standardized Tests: A Summary Outline, 
148; Achievement vs. Aptitude: An Operational Distinc- 
tion, 152; Summary, 178 

8 ReUability or "Could We Find Our Way Again?" 180 

Techniques for Computing Reliability, 180; Comparing 



vi Contents 

the Techniques, 191; Cautionary Notes on Interpreting 
Reliability Coefficients, 192; A Final Note on Correcting 
Limited Reliability Coefficients, 194; Summary, 197 

9 Validity: Are We At the Right Place? 199 

Content Validity, 201; Criterion-Related Validity, 204; 
Construct Validity, 210; Interpreting Validity Studies: 
The Expectancy Table, 213; Hints, Warnings, and Sug- 
gestions: Factors to Consider in Measuring and Inter- 
preting Validity, 226; Summary, 229 

10 Score Reporting: "Now That We're Here, Let's Write 

a Little Letter to the Folks Back Home" 231 

Reporting Raw Scores, 232; Percentile Ranks and Per- 
centile Norms, 234; Standard Scores and Standard Score 
Norms, 242; Grade Equivalent Norms, 249; Marking: 
Reporting the Results, 256 

11 "Acquired Behavioral Dispositions" The Measurement 

of Attitudes and Interests 260 

The Relationship between a Student's Attitude and His 
Behavior, 261; Attitude Change Programs Linked to De- 
sired Behavior Change, 263; Likert-Type Items, 267; The 
Method of Equal-Appearing Intervals, 275; Occupational 
Choice as a Measure of Interest, 277; The Need for a Base 
Line with Attitude Measures, 278; The Sociogram: Social 
Relationships Among Students, 279; Summary, 284 

12 On Evaluating a Project: Some Practical Suggestions 286 

Short Term/ Long Term: The Impact of the Project, 287; 
Concerning Car Salesmen: What, Specifically, Did the 
Project Writers Promise to Do? 289; Evaluating Objec- 
tives: Two Interpretations, 290; The Budget: Sections 
Which Cost More Should Produce More, 291; The Project 
Staff: Concerning Prior Commitments and Real Power, 
292; Timetable: If a Child Is to Be Born in October, 
Conception Must Occur Somewhat Earlier, 294; Individu- 
alized Instruction Requires Individualized Reporting, 
295; Continuous Assessment: A General Evaluation 
Model, 296; Some Other Roles of an Evaluator: Keeping 
Communication Lines Open Among All Groups in the 
Project, 301; Training Internal Project Evaluators, 303 

Appendix 305 

Index 317 



Preface 



This is a book about eductional measurement — but that obviously 
includes educational testing as well. I had some fairly specific 
guidelines in mind during the time of composition. See if some of 
these are what you have in mind when you consider the topic: 

Most education students learn about educational measurement 
not because it will be a life work, but rather, because these con- 
cepts are useful (necessary even!) in much educational decision 
making. 

Thus, the book should be directed toward this group of pre- 
service teachers, counselors, administrators, and not be narrowly 
conceived as a first course for future specialists in the field of edu- 
cational measurement. A book directed at the practioners can be 
used for future specialists; but a book directed toward the special- 
ists is probably more appropriate for the generaUst. A survey of the 
contents will show which topics have received more coverage than 
one usually finds in a beginning measurement text, as well as some 
topics which have much reduced coverage herein. Both changes in 
emphasis were made in light of the goal of making the book as 
practical and as useful as possible. 

The book has a heavy emphasis on setting objectives and you 
will find a thorough discussion of the reverse side of the objectives- 
setting movement. It is long on practical hints and cautionary 



viii Preface 

notes to users, and short on history, theory, and difficult computa- 
tions. The discussion of standardized tests focuses on uses and 
misuses — not on selection. 

Educational measurement is serious business — but not so serious 
we can't have a little fun learning it. I thought it might be inter- 
esting to try to bring a smile to the reader's face now and then. 
Even serious topics can be treated with some humor. 

The book begins with a farcical play that deals with a very se- 
rious philosophical point. An entire semester could be spent dis- 
cussing topics such as, "Should objectives be set?" and if so, "Who 
should establish the objectives?" These really lead to the fairness 
questions: "Is it fair to the teacher to have objectives imposed?" 
and "Is it fair to the teacher not to impose objectives?" These 
questions can be directed toward the student as well. The chapter 
is designed to open dialogue on these issues. 

From here, the chapter on setting objectives logically follows 
with a survey of criticisms of the behavior objectives concept which 
naturally follows that. These three chapters set the foundation for 
those which follow. Hopefully, these introductory chapters are writ- 
ten in a light enough vein so that the reader will go to the last nine 
chapters in a good frame of mind. 

The remaining chapters include standard topics such as reliabil- 
ity, validity, standard scores, standardized measures, and score 
reporting. But some rather unique chapters are also included: 
Chapter 4, for example, contains a systematic and practical guide 
to the preparation of classroom measures of all kinds. A single 
philosophy has not been assumed so that the guide will work for 
individual or group instruction, and for instructional programs with 
and without behaviorally stated objectives. Some rather searching 
questions are addressed — questions which should make the student 
consider facets of his or her own philosophy of education. 

The chapter on attitude measurement (acquired behavioral dis- 
positions) is quite extensive. With only this chapter as background, 
the teacher or administrator should be able to make a good first at- 
tempt at developing an instrument to assess attitude. 

The book ends with a chapter filled with hints on project or 
program evaluation. Although this topic may not be of much prac- 
tical interest to the teacher as are the other eleven chapters, it has 
been included for other reasons. Nearly everyone in the profession 
eventually gets involved with a project of some kind — a research 
effort, a program development, or a plan for curriculum change. I 
felt that some background information on project evaluation would 



Preface ix 

be useful and might tend to help the person focus more sharply on 
the program at hand. 

I want to thank Professor DarreU Sabers of the University of 
Arizona and Professor Larry Braskamp of the University of Ne- 
braska for their reviews and comments on the original manuscript. 
In addition, the ideas of two of my colleagues, Professors James 
Hall and Robert Coughlan, are surely embedded throughout the 
manscript due to long conversations about certain of the issues. A 
note of special appreciation goes to Ms. Carolyn Dirkes, who cor- 
rected my spelling and bad grammar, checked out many of the 
issues in the library, contributed most of the exercises dealing with 
measurement in literature classes, and typed the manuscript. 



To that particular student who kept saying "What does that 
mean?"— "Why should I learn this?"— Things like that which 
finally rearranged my thinking and made this book possible. 



". . . and in Arithmetic, 

I want you to focus 

a little more on 

problem solving this year" 



You are going to take a trip in this book. 

No, not the drug-type trip. This trip is through the area of 
educational measurement. Actually, any kind of learning requires 
answers to these three key questions: 

Where are we going? 

How do we get there? 

How will we know when we get there? 

Anyone applying principles of educational measurement must 
address all three questions, of course, since each answer depends 
on the other two answers. 

The first question is addressed to the statement of the objectives 
of the learning program. The second is the instructional strategy 
question: How shall the objectives be approached with the stu- 
dents at hand? The last, of course, is the testing, measurement, 
and evaluation part. The scope of this book includes questions one 
and three. These two — Where are we going? How will we know 
when we get there? — are inseparable and must be considered to- 
gether. The large middle topic of selecting the instructional strat- 
egy most appropriate for each given situation wOl be set aside for 
another time. 

The topic of the first three chapters is the specification of objec- 
tives. Behavioral objectives imply unambiguous interpretations of 



2 Educational Measurement 

goals. What can be the ramifications of failing to translate general 
statements of objectives into specific and unambiguous language? 
To help you begin thinking about the answer to this question, the 
following Uttle three act farce has been developed. The players 
include a superintendent, three of his more diverse teachers, a 
swinging secretary, and some substitutes. When Superintendent 
Twinky makes a general pronouncement regarding mathematical 
problem-solving skills, the three teachers respond in very different 
manners. 

The first chapter will conclude, following the farce, with some 
suggestions to Superintendent Twinky on the topic of translating 
the general objective to more specific language. In the second 
chapter, abandon your pose as information sponge, and become an 
active participant in learning the technique of specifying behavioral 
objectives. The goal for chapter 2 is quite specific: In any situation 
where a general objective can be translated to behavioral language 
it is our intention that you should be able to do so. The assumption 
is, of course, that you will actively participate in the self-instruc- 
tional program of chapter 2. 

Those who loudly call for more widespread emphasis on the use 
of behavioral objectives are not without detractors. This is only 
right and proper. Situations exist where an overemphasis on be- 
havioral objectives would lead us away from, rather than toward, 
good educational practice. Some possible misuses of behavioral 
objectives frequently mentioned by these detractors are surveyed 
in chapter 3. 

With that short introduction, on with the play. 

A Tweek from Twinky 
(Prologue) 

(A summary of this year's standardized testing results was in 
the morning's mail. Dr. Twinky T, a veteran superintendent 
with eleven years of combat experience in this K-8 school 
district of some three thousand students, is poring over the 
results. His assistant superintendent, Mrs. Clutz C also has 
a copy of the report from the Eastern Testing Society. She 
runs her finger along the column of numbers and occasionally 
makes a mark with a very sharp no. 4 pencil.) 

Mrs. C: Who is this fellow Kuder? Why do they use his formula 
20? What is a standard error? Are there unstandard errors 
and unstandard deviations? 



Problem Solving 3 

Dr. T: Kuder invented reliability — you know, how reliable the 
test is. A standard deviation teUs us how far a student 
deviates from a standard, and a standard error means the 
student makes standard types of errors — common things, 
you know. Some students make unstandard errors but 
these are for the standard ones — errors, not students. 

(All of this last statement is utter jibberish. Remember that 
you already know that Dr. Twinky is a veteran superintendent 
with combat experience.) 

Dr. T: (continuing) Look at how our medians compare with last 
year's scores. That's the important part. (He concen- 
trates, writing down numbers, mumbling and scowling. 
After some time, he makes an announcement.) Mrs. C, we 
have been medianed downward in Problem Solving! 

Mrs. C: In Problem Solving? And our children have so many 
problems! And to be medianed, no less. . . . My! 

Dr. T: No, no ... in mathematical problem solving. Something 
must be done. Superintendent Pinky's district will soon 
out-median us! 

(Superintendent Pinky has an elementary district near Dr. 
Twinky's school. Each year, when the test scores came in, the 
two compare notes over a series of Bloody Marys and lunch at 
a nearby restaurant.) 

Mrs. C: Why don't we simply announce to the teachers that they 
must spend more time on instruction in the area of Math 
Problem Solving because the school median was too low? 

Dr. T: (Pulling himself up to his full height.) Mrs. Clutz, we 
cannot have our teachers gear their instruction to a 
testing programl The other aspects of our overall pro- 
gram are equally important — perhaps even more so! 
(Thinking.) I need to be more subtle. One can't simply 
come out and say such a thing! (More thinking. Soon 
a smile lights the face of the veteran superintendent.) 
Yes. That's it! I'll simply include a paragraph or two 
regarding a general objective like "developing good mathe- 
matical problem solvers for our society" in next year's 
in-service days. (These are held each year prior to the 
start of classes.) The teachers will all understand what 



4 Educational Measurement 

I really mean. That way no one can accuse me of teaching 
for the achievement tests! 

ACT I 

(The scene is the school's gymnasium-auditorium combina- 
tion. Dr. Twinky is at the front, flanked by Mrs. Clutz, three 
district principals, and the President of the School Board. Dr. 
Twinky is speaking, and has been for fifty-one consecutive 
minutes, according to the School Board President's watch. Dr. 
Twinky is utilizing the audio and visual modes simultaneously 
by punctuating his talk with overlays prepared by Mrs. Clutz 
for use on one of the school's many overhead projectors.) 

Dr. T: (Reading from a prepared text without emphasis.) . . . An- 
other area of special concentration this year is that of 
problem solving in mathematics. Knowledge of the num- 
ber facts is not enough. Being able to add and subtract 
is also not enough. Each day we all face problems which 
involve numbers. Our students must be able to handle 
these problems if they are to be effective citizens in 
tomorrow's society. (The last line received some emphasis 
and he looks around for approval. Only scattered ap- 
plause.) Finally, in social studies this year. . . . 

(The President of the School Board checks his watch for the 
thirteenth time as the speech continues. You have been merci- 
fully spared from the remainder of the speech since the few 
sentences above are hereby labeled the Superintendent's Gen- 
eral Objective Regarding Problem Solving in Mathematics. As 
he does every year, a copy of the highlights of the State of the 
District speech is sent to each faculty member in case someone 
inadvertently missed any of the pearls in the speech.) 

ACT II 

(The King had made his decree. The time moves now to the 
first day of classes. The sheep are gathering nourishment from 
the shepherds. Among the shepherds one finds variation. All 
have heard and read the King's proclamation. All are sincere 
in the desire to feed the flock. But each seeks a different pas- 
ture. And as any graduate lamb psychologist knows, sheep 



Problem Solving 5 

who constantly partake of different kinds of grass eventually 
respond to stimuli in varying manners.) 

Scene 1 

(Miss Harriet H notes the time. Just after 2:00. Another hour 
and twenty minutes and the restless ten-year-olds will go 
home. Or at least back to whatever enclosure it was from 
which they were issued this morning. Did they like her? Did 
they all like her? Didn't little Dennis frown at her? Or was he 
just concentrating hard? When Hester spoke out of turn was 
she just anxious, or was she trying to show displeasure? Miss 
Harriet wanted the children to like her. She was so glad Dr. 
Twinky had specifically asked that they pay special attention 
to problem solving in mathematics this year. Of course problem 
solving meant games. She would provide games, games, games, 
even if it took one fourth of her salary.) 

Miss H: Now in addition to your Arithmetic and You book we 
will have a special math day every Thursday. Thursday 
will be game day. Here are some of the games we can 
play. This game is called FRACTO. See the FRACTO 
board, the dice, the score sheets, and the special FRACTO 
prizes. Next we can play MEASURE-MAD. In this game 
you can win prizes by making special measurements of 
things using strange measuring instruments. There are 
other games too. Every Thursday we wiU spend all after- 
noon playing with math games. This will help you learn 
to solve math problems. 

Dennis: Gramma bought me MATH-0-RAMA for my birthday. 
My Dad likes to play it with me. Could I bring it some 
Thursday? 

Miss H: (Glowing, since the earlier frown from Dennis must have 
been concentration after all. Dennis must like her, or he 
wouldn't be bringing this special game.) Oh, that would 
be just fine, Dennis. You bring MATH-0-RAMA any 
time. Has anyone else a math game they would like to 
bring? 

{Five other members of the class mention games they will 
bring. Miss H is enthused. Now the children will like her. And 



6 Educational Measurement 

she is doing just what the Superintendent asked. The students 
will be learning about problem solving in mathematics. Isn't 
that what educational games are for? ) 

ACT II 

Scene 2 

(The class of ten-year-olds in Room 44 is approaching a state 
of total anarchy — and Mr. Sincere S is pleased. No stifling of 
initiative or creativity for Mr. S — no sir! He had read all the 
$1.95 paperbacks about "stultifying silence" and the evils of 
authoritarian control. Soon the natural curiosity of these ten- 
year-olds would reappear after years of laying dormant. He 
would have had a swelling of pride at that thought, had it not 
been for the pounding of the headache which had begun some 
hours before. Nonetheless, he would prevail. Mr. S was cur- 
rent. He was sure of his chosen course of action. Above all, 
he was RELEVANT. Suppose we see how a "relevant" and 
very current teacher interprets Dr. Twinky's edict regarding 
math problem-solving skills.) 

Mr. S: Friends, FRIENDS! Fellow seekers of the truth. (One 
Truth Seeker sought to reaffirm some of Galileo's earlier 
statements regarding gravitational pull. His method of 
data collection involved hurling erasers around the room. 
After repeating his call numerous times, the room noise 
level was reduced to a low roar.) Friends, numbers help 
us in the REAL world! 

Truth Seeker 1: Whaddyamean real world? This is the real 
world! 

Mr. S: (Attempting to sound wistful and wise — difficult when 
shouting.) The real world is out there (pointing) with the 
downtrodden and the lost people. We won't do problems 
from this bookl (Holding up a copy of Arithmetic and 
You.) 

Truth Seeker 2 : You mean we won't have arithmetic this year? 
(Turning and shouting.) Hey! No arithmetic this year! 
(The remainder join in the cry.) 

Mr. S: (Again, after a semblance of control returns.) We won't do 
problems from this book. We'll do problems which are 
relevant — things you want to learn about. Things like 



Problem Solving ' 

unevenness of incomes. . . . like pollution counts and .... 
(The ten-year-olds don't understand those very well and 
go back to more interesting activities.) 

{During the next ten days, Mr. S learns how to maintain a 
room environment which is much more controlled. A very 
sincere young man, the students feel the sincerity and respect 
him for it. When it comes time to work on problem solving in 
arithmetic, he always works from a copy of the daily paper. 
Wherever there are numbers, graphs, trends, or anything at all 
dealing with numerical ideas, he tries his best to put this in the 
language of the ten-year-old. The students like this approach. 
They also like Mr. S.) 

Mr. S: (To no one in particular.) That old Twink is okay. He 
knows the rote stuff won't sell with the new breed of ten- 
year-old. He wants to work on problem solving in the real 
world. Everyone must agree that this is the REAL mean- 
ing of problem solving. 

ACT n 

Scene 3 

( The bulletin board is newly done in a September theme. The 
desks reflect a carefully planned pattern of disarray. A close 
look would see them forming somewhat of a semicircle around 
the teacher's desk, as if the desk were a stage and the students 
were the spectators. The desk belongs to Mrs. Gretchen G.) 

Mrs. G: Boys and girls, look at the first problem on the worksheet 
that Tommy just gave you. Will you read the first prob- 
lem, Louise? 

Louise: "Mary brought 112 cookies to school. There are 28 chil- 
dren in Mary's class. Of these 13 are boys. If the cookies 
are divided evenly among all of the students, how many 
cookies will each student receive?" 

Mrs. G: Fine. Richard, tell me how many different numbers are 
in that problem. 

Richard: (After some delay.) There are three numbers — the 112, 
the 28, and the 13. 

Mrs. G: Right. Now I want everyone to start finding the correct 
answer to the question in the space I have left on the 



8 Educational Measurement 

worksheet. (The students dutifully begin to work. Mrs. G 
starts to wander around the room, looking over the work 
of each student.) 

(Mrs. G is an experienced teacher. Except for the ten years 
when she "retired" to make sure her own two children got a 
good start in school, she has been teaching each of the thirty 
years since her graduation from a nearby state teachers' col- 
lege. She is an institution around the school. Parents try to 
use their influence to insure that little junior "gets Mrs. G." 
Mrs. G is also well liked by the school administration — the 
building Principal and the Superintendent. One reason for this 
admiration rests on Mrs. G's uncanny ability to know pre- 
cisely what each really means when they use sentences which 
sound like they mean something else. Let's continue to listen 
to Mrs. G's first presentation regarding problem solving in 
arithmetic.) 

Mrs. G: Roy, what answer do you have? (Mrs. G knows very well 
that Roy has an answer of 4, since she just looked at it. 
Mrs. G likes to call on the child who has a correct answer, 
since this provides positive reinforcement.) 

Roy: Four. 

Mrs. G: That's very good. Tell me, Roy, were there any numbers 
in the problem that you did not need to find that answer? 

Roy: (Re-reading the problem.) I didn't use the 13. 

Mrs. G: Right! Sometimes a problem will have more information 
than you really need. The extra information is put in the 
problem to see if you really know what you're doing. Now, 
here is how I want you to do this worksheet. First, do 
the problem. Then circle any number given in the problem 
which was not needed to solve the problem. Just to get 
started, circle the 13 in problem 1. Everyone do that now. 
(Pauses and watches.) Fine. Any questions? All right, 
you can work on the problems for the next fifteen minutes. 

(Mrs. G closes the brand new manila folder labeled "Math 
Problem Solving." In the folder are two forms of three differ- 
ent standardized tests used for ten-year-old students — like the 
ones in her class. The problems from the tests have been care- 
fully categorized into types. One type was called "Problems 



Problem Solving 9 

with Extra Distracting Numbers." This problem type had 
been the inspiration for today's worksheet. Mrs. G knew that 
standardized testing in the district would be done in February. 
She had made plans to cover each of the different problem 
types once before the winter holidays, and then do intensive 
reviews in January, prior to the testing.) 

Mrs. G: (To herself.) I thought that the problem-solving scores 
would go down last year, what with this new Arithmetic 
and You book. It's just too modern — not enough of the 
good old problems in it. Dr. Twinky doesn't like to see 
those test scores low, and neither do I. Oh, I won't teach 
them the real test — but those test writers know what 
they're doing, and they know what problem solving is. I 
mean, they know better than I do, don't they? I'll make 
sure my students can do aU of the different kinds of 
problems found on these tests. That's problem solving — 
doesn't it say so in the title of the test? Everyone would 
agree that this is problem solving. Wouldn't they? 

(The months pass, and it is February 1 of the same school 
year. The three teachers all know that February 4 is a meeting 
day for them and they will not be in class. They — Miss Harriet 
H, Mr. Sincere S, and Mrs. Gretchen G — decide to have a 
mathematics problem-solving test on that day. A substitute 
teacher should at least be able to administer a test. Each 
teacher, of course, constructs the test around his or her own 
definition of problem solving. Each takes the test draft to the 
secretarial pool for typing, duplicating, and collating.) 

ACT III 

Scene 1 

(The secretarial pool consists of three people. The climate in 
the crowded room is not pleasant. The "new girl" was late 
again, and appears to have missed considerable sleep. Mary- 
belle, the new girl, had in fact missed more than just a little 
sleep — but it had been worth it! Now those two old fools had 
dumped three tests on her to type. All those damn numbers! 
Why do schools need arithmetic anyway?) 

Marybelle: This one belongs to Mrs. G, right? 



10 Educational Measurement 

Secretary No. 1: Obelgoodch. (Without looking up.) Mrs. G is 
such a wonderful teacher. We need more like her. Why 
my Sarah, when she had Mrs. G . . . . 

Marybelle: That's what I thought. Mrs. G certainly seems to 
Hke to have her students play math games. And this one 
— the one that looks just like the big test the kids take 
this month — this is the one from Mr. S? 

Secretary No. 2: (Again, without looking up.) I suppose. You 
can always tell them by the handwriting, can't you? No? 
Well, after a few years you'll be able to. 

Marybelle: (Cringing at the implications of "in a few years" in 
the last comment.) Maybe. Then this one with all the 
talk about poverty and economic conditions and so forth 
must be from Miss H. 

(To eliminate further confusion, Marybelle types the "correct" 
names with the tests. She types "Mrs. G" on the test covering 
many math games, "Mr. S" is linked to the test which looks 
like it is straight from a standardized battery, and "Miss H" 
becomes the heading for the "socially relevant" test. Marybelle 
types the tests, duplicates them, collates thirty-five copies of 
each, places the copies in envelopes, and takes them to the 
appropriate rooms.) 

ACT III 

Scene 2 

(The location is the teacher's lounge. The three substitutes are 
correcting the tests. Not all substitutes correct tests, but these 
are regular substitutes — experienced teachers who are not 
working full-time. Each likes to be part of the class on the days 
she is with them. Each is about half way through her stack of 
tests. Let's pick up the conversation.) 

Sub. for 

Mrs. G: I always hear that Mrs. G is such a good teacher. 

Sub. fob 

Miss H: Of course she is. And after looking at these tests from 

Miss H's students, I'm going to pull every string to make 

sure my Cloe gets Mrs. G! 



Problem Solving H 

Sub. for 

Mrs. G: But these tests are terrible! Mrs. G uses words that the 
students don't even know. It looks like they tried to play 
games that the students couldn't understand, and Mrs. G 
didn't even know it. No one has gotten more than ten of 
the fifty correct yet. 

Sub. for 

Mr. S: Mr. S had 45 questions and the highest score so far is nine. 
And these questions are just like the ones that will be on 
the achievement tests this month. Oh my! 

Sub. for 

Miss H: Gee, and I thought that these were bad. These students 
don't seem to know anything about the social issues. And 
she's so young. You'd think she could do better than that. 

Sub. for 

Mr. S: I think it must be that new math book they bought. I told 
Twink it was no good when he bought it. He should have 
listened to me. (She and her husband are social acquaint- 
ances of the Twinkys, and she lets the others know, 
whenever possible, that he is more than just an employer 
to her.) 

Sub. for 

Miss H: I think we should show these results to the Principal. I 
wonder if he knows that our ten-year-olds know virtually 
nothing about problem solving. And after Dr. Twinky 
made a special point of this in his State of the District 
speech. 

Sub. for 

Mr. S: (Looking up.) We won't have to. Here comes Dr. Twinky 

now. (Calling.) Oh, Dr. Twinky, would you come over 

here for a minute? 

EPILOGUE 

Suppose you write this part yourself. Your epilogue could func- 
tion as a sort of a projective test. A tragic ending, a sad ending, a 
noncommittal ending, a bittersweet ending, a happy ending — each 
of these could tell us something about your own personality. 



12 Educational Measurement 

Questions 

1. Which definition of problem solving appeals to you the most: Miss H 
and the games; Mr. S and the socially relevant approach; or Mrs. G 
and her standardized test problems? Defend your choice in one or 
two paragraphs. 

2. Does "problem solving in arithmetic" mean anything else to you 
besides the interpretations given by the three teachers? 

3. The students took the "wrong" tests. That is, each class took a test 
not written by its teacher. Now the teachers are being criticized for 
the low performance of their students. Is this fair to the teachers? Do 
you think they are deserving of criticism? 

4. The students were probably pretty frustrated by the tests they took. 
Although they took the "wrong" test, the fact is, they did imiformly 
bad. Have the teachers been fair to the students by not having a 
common interpretation of problem solving? 

5. Suggest one or two ways that the whole fiasco might have been 
avoided. 



The play will probably do poorly in next year's Pulitzer Prize 
competition. The fundamental concept involved should be carefully 
considered by anyone who is not familiar with the fuel that feeds 
the behavioral objectives fire. Why have so many jumped on the 
behavioral objectives bandwagon? The answer is in the little play. 

The key is the need for unambiguous communication — precision 
of language, avoidance of misunderstanding. Dr. Twinky exhorted 
his stafip to concentrate on problem solving in arithmetic. The char- 
acters were overdone; the plot was superficial; but the overall result 
was not unrealistic. That is, whenever ambiguity exists in the 
meaning of the statement of a goal or direction, the respondents 
will interpret the statement in varying manners. Each player in the 
farce was sure that his or her private interpretation of problem 
solving was the correct one. Had you been teaching ten-year-olds, 
you may have had something to add to all of their interpretations. 

Look at the statements below. Can you recognize any as state- 
ments you have heard — or made? 

"It's important for us in this school to have the child develop a 
good attitude toward himself." 



Problem Solving 13 

"How can our graduates succeed without basic understanding of 
the way our economic system works? We try to stress economic 
education here." 

"It isn't enough to learn a bunch of tricks and facts. We try to 
instill a basic understanding of the structure of mathematics. This 
will allow the child to fit all kinds of mathematical problems into 
one framework — present day problems and some we don't even 
know about yet." 

"The core about which our curriculum revolves is good citizen- 
ship. The goal is for our graduates to be good citizens." 

"Meet the child where he is. You know, teach him according to 
his needs and abilities." 

Have you heard any of those before? Was it from an education 
professor? in a professional journal? from a building principal? from 
a parent? a friend? The statements are not the kind of thing one 
immediately writes off as the ravings of a wild man. They reflect 
the kinds of things that principals, school superintendents, school 
board presidents, teachers, curriculum directors, and counselors 
make. Think of the misinterpretations you found in the little play. 
Keep those misinterpretations in mind. Now read through the five 
general statements again. 

Do the statements communicate unambiguously? 

Take the first statement: "It's important for us in this school to 
have the child develop a good attitude toward himself." How would 
you decide if a child has a good attitude toward himself? Think of a 
person you know who is very, very moralistic. Everyone knows a 
person like that. How would that person decide if a child has a good 
attitude toward himself? Do you know someone who is considered 
to be amoral? How would that person work toward having a child 
develop a "good attitude toward himself"? 

Another play is developing! The moralist and the amoralist may 
define two extremes for responses — perhaps yours is somewhere in 
between. The teaching strategies for a person at one extreme will 
probably differ significantly from those of a teacher at the other end 
of the morality continuum. The statement about "developing a 
good attitude toward himself" must be ambiguous. One wonders if 
the person who originally wrote the objective could possibly agree 
with the different teaching strategies and educational program 
which come from the two extreme interpretations. The probability 
of one person agreeing with widely disparate views on the same 
topic seems remote. 



14 Educational Measurement 

Questions 

6. Do people of different political persuasions differ on what they think 
the economic policies of this country should be? Think of three differ- 
ent interpretations which might be made of the second statement 
regarding economic education. 

7. List four behaviors that you think exemplify a good citizen. Be pre- 
pared to give these in class, to see how your interpretation compares 
with that of others. 

8. How would you suggest that a teacher find out what a child's "needs 
and abilities" really are? 



Of course, when one communicates in an unambiguous manner 
he opens himself to criticism. People understand precisely what the 
communicator is talking about. Sometimes this is unfortunate for 
the communicator. Safety can be found in hiding behind a general 
statement wherein many widely disparate philosophies can all find 
agreement and support. 

Take Dr. Twinky, for example. He might simply have said: "I 
want you to stress mathematical problem solving this year because 
the test scores in this area were low. I suggest that we study the 
kinds of problems found in these tests and work hard to insure that 
our students have the requisite skills for solving problems of this 
nature." Why didn't he simply say that? The answer was in the 
play. He did not want anyone to think that he was allowing the 
standardized testing program to dictate his curriculum; and stan- 
dardized tests should not dictate the instructional program of a 
school. The school's program is much bigger and broader than any 
standardized achievement testing program. 

This is not the place for a long discussion regarding concepts 
which are, and are not, important areas of problem solving in an 
elementary school mathematics program. The point is this: 

// ... he is sincere in the desire to make sure that the students 
in his school can solve mathematical problems (like those 
found in the achievement test) at a satisfactory level .... 

Then . . . wouldn't the most logical step be to make sure that 



Problem Solving 15 

all of those who would be implementing that desire would 
be working toward the same specific objectives? 

In other words, he should have translated that one general state- 
ment into a series of very specific, expected behaviors. Perhaps he 
would not be the translator. This might be done by a committee of 
teachers, or the mathematics curriculum director, or through an 
abstraction of the work of a national committee (like the Commit- 
tee to Assess the Progress of Education, sometimes called the 
National Assessment Committee). The point is that the task is not 
impossible and could have been done. One of the primary vehicles 
for translating from general to specific is through the use of behav- 
ioral objectives. By the end of the next chapter, you should be able 
to participate in such a project because you should then be able 
to state general concepts in behavioral language. 

So the concept "improve problem solving in mathematics" could 
be translated to behavioral language. Should it be? The general 
statement about "understanding how our economic system works" 
will undoubtedly be translated in very diverse manners by those of 
different political persuasions. Is that so bad? Is it fair to the 
students when these general statements are interpreted for them? 
Is it fair to the students to allow interpretations to vary? To be 
sure, some interpretations will be more right than others. Is it fair 
to a teacher to take away his right to interpret each general concept 
in his own way? Conversely, is it fair to not be very explicit, only to 
be critical at a later time of the teacher's interpretation? 

Some real issues may be bothering you at this time. Can all very 
real, important and general objectives be translated to behavioral 
language? How about the relationship between consent to or agree- 
ment with the specific objectives on the one hand and evaluation 
of performance on the other? That is, what if the teacher or student 
does not subscribe to the specific interpretation of the objectives? 
Will certain teachers or students suffer in the evaluation due to an 
honest disagreement? 

Those are just some of the questions people often raise with the 
behavioral objectivists. We will put the issues aside temporarily, to 
be picked up again in the third chapter. Put the doubts and ques- 
tions out of your mind until then, for it will be much easier to 
discuss the issues after you have completed chapter 2, where you 
should learn to construct behavioral objectives. 



16 Educational Measurement 

Questions 

9. As a warmup for chapter 3, take one of the issues raised in the last 
few paragraphs and present your point of view. The issues include: 

a. Should all general objectives be stated in clear and unambiguous 
language (assuming it is possible to do so) ? 

b. Is it fair to the students when one specific interpretation is agreed 
upon? 

c. Is it fair to the students not to agree to one specific interpretation? 

d. Is it fair to the teachers when one specific interpretation is agreed 
upon? 

e. Is it fair to the teachers not to agree on one specific interpretation? 

f. Is it fair to evaluate someone's performance related to some 
specific interpretation of an objective if the person does not agree 
with the interpretation? 

Try to argue both sides of these questions. Cite hypothetical ex- 
amples. 



Thinking Positively About 

Behavioral Objectives and 

Learning to Write Them 



Remember our exhortation at the end of the previous chapter: 
Think positively about behavioral objectives — at least until chapter 
3. In this chapter you should learn either how to construct a behav- 
ioral objective or to evaluate an objective which has been con- 
structed by someone else. You will be in a much better position to 
separate the positive and negative sides of the behavioral objectives 
issue when you have learned the technical considerations governing 
their construction. 

This chapter is designed to be self-instructional. To use this 
chapter effectively, you cannot be a passive receiver of information, 
as you are with so much that you read. Throughout the chapter 
you will be asked to respond to questions, and your responses will 
be an integral part of the instructional process. Don't just skip by 
them! Make an explicit, written response to each question before 
you sneak a peak at the answer which is given. 

The first task, it seems, should be to contrast behavioral objec- 
tives with their nonbehavioral counterparts. Before this is done, 
however . . . not meaning to nag . . . but . . . 

DO YOU HAVE A PENCIL AND PAPER IN FRONT OF 
YOU READY FOR YOUR RESPONSES TO THE QUESTIONS 
TO BE ASKED IN THIS CHAPTER? 

In the little play. Dr. Twinky urged his teachers to pay more 
attention to mathematical problem solving. Three teachers inter- 
preted his remarks in quite different manners. The key is spec- 

17 



18 Educational Measurement 

ificity. A behavioral objective is specific. A nonbehavioral objective 
is not. Here is a behavioral objective: 

Given a 200-word news article chosen from the front page of 
the local newspaper, the eighth-grade student should be able 
to read the article out loud to his class.^ 

Most teachers would interpret that objective in the same manner. 
That is, most teachers would choose 200-word articles and have 
each student go to the front of the class and read the article. 

The first question for your poised pencil and paper: Translate 
the behavioral statement about the news article into a nonbehav- 
ioral one. That is, translate it into a statement which is ambiguous 
such that it would probably be interpreted in markedly different 
manners by different teachers. (After each of the questions a rule 
will appear. The rule functions as an eye barrier. Try not to look 
past the rule until you have made your response in writing to the 
question. Commit yourself to an answer before looking at the one 
given. You'll learn better if you do.) 



Here are some possibilities: "Students should learn to read news- 
papers." "Reading out loud is important." "Young kids need to be 
aware of current events." The behavioral objective which was given 
could have been spawned by any of these three general statements. 
Each of these three is nonbehavioral. Each is ambiguous and would 
be interpreted in highly variable manners. 

Reviewing then, these two points have been made: 

1. Behavioral objectives are specific. Nonbehavioral objectives 
are not specific. 

2. A behaviorally stated objective will be interpreted in nearly 
the same manner by all who use it. A nonbehaviorally stated 
objective, being ambiguous, leads to highly variable interpre- 
tations. 

Marybelle, the tired secretary, managed to raise havoc with the 
math problem-solving tests. When the tests were given to the wrong 
classes the classes did poorly. Another important manner for dis- 



1. Note that no evaluative statement has been made regarding his per- 
formance at this task. This has been skipped because we do not want to get 
bogged down right here. 



Thinking Positively About Behavioral Objectives 19 

tinguishing between behavioral and nonbehavioral objectives is 
suggested: Most people would evaluate a behavioraUy stated ob- 
jective in the same manner, whereas the evaluation of a nonbehav- 
ioraUy stated objective could vary widely. 

Consider this objective: Fifth-grade students should know about 
the metric system. 

Teacher number 1 sets out some objects, makes available a meter 
stick, a one liter container, and a balance beam using metric mea- 
sures, and asks the students to make measurements in metric units. 
Teacher number 1 interprets "know" to mean "use." The students 
"know" about metric units when they can use metric measuring 
devices to make measurements. 

Think of two different ways that other teachers might evaluate 
the objective "fifth-grade students should know about the metric 
system." Teacher number 1 interpreted the statement to mean 
"use with real objects." How would others evaluate to see if stu- 
dents "knew" about the metric system? Write down two other 
ways. 



Many answers are possible. For example, you might have chosen: 
(a) Ask the students questions which require converting from one 
unit to another within the metric system, (b) Require the students 
to convert units of weight, length, and volume from metric to 
English units, (c) Conversely, ask the students questions which 
require converting units of weight, length, and volume from English 
to metric units, (d) Ask students to estimate the height, weight, 
and volume of various objects in metric units. 

I do. I did. I have done. Doing — there's the key to stating some- 
thing in behavioral terms. What will the learner be doing when he 
reaches the objective? If you and everyone else are in pretty close 
agreement on what the learner will be doing when the objective is 
attained, the objective is in behavioral language. If there is dis- 
agreement, the objective is not specific enough. 

Write the numbers 1 through 14 vertically on your paper so you 
can answer yes or no to each of the following questions. For each 
of the fourteen short statements below, answer yes if the statement 
tells quite clearly what some learner will be doing. Write no if you 
think the statement does not communicate this information un- 
ambiguously. 



20 Educational Measurement 

1. grasp the significance of 

2. list 

3. gain insight into 

4. appreciate 

5. define 

6. have a knowledge of 

7. contrast 

8. solve 

9. identify 

10. fully understand 

11. construct 

12. explain 

13. know 

14. compare 



What is a learner doing when he is "grasping the significance of" 
some concept? Would one wait for the learner to smile, sigh, or say 
"ahah"? Would most teachers evaluate the act of "grasping the 
significance of" in the same manner? "The learner will grasp the 
significance of the decimal system." The statement is clearly an 
ambiguous one. The first answer on your paper should be "no." 
Other "no" answers should be marked for numbers 3, 4, 6, 10, 
and 13. 

The second verb is "list." The meaning is fairly clear. The 
learner will be making some sort of list. Of course, to be completely 
unambiguous one needs to state how the learner will be making the 
list (oral or written), what aids he will have in his work, and what 
we expect to see on the list. The difference between "list" and 
"grasping the significance of" seems clear. Most people will imme- 
diately be in agreement on what is involved with the making of the 
list. This is not true with "grasping the significance of." 

"Gain insight into," "appreciate," "have knowledge of," "fully 
understand," and "know" are in the same category as "gain insight 
into." Each suffers from a lack of clarity — a lack of commonly held 
meaning. Different teachers would teach toward them in varjring 
manners. Different evaluators would measure their attainment with 
dissimilar instruments. Each verb fails to communicate without 
ambiguity what the learner will be doing when he reaches the ob- 
jective. 



Thinking Positively About Behavioral Objectives 21 

On the other hand, "list," "define," "contrast," "solve," "iden- 
tify," "construct," "explain," and "compare" do communicate; 
although each is somewhat ambiguous standing alone, since the 
content, method, and conditions have not been included. Each of 
the verbs has a relatively precise meaning and describes a fairly 
specific behavior on the part of the learner. Each tells what the 
learner will be doing when he reaches the objective. 

Suppose you try the same sort of activity again. This time the 
questions will be based on entire sentences rather than just verbs. 
Write the numbers 1 through 6 on your page. Answer yes if you 
think the sentence communicates in a fairly unambiguous manner 
what a learner will be doing when he reaches the objective, and no 
otherwise. If you are in doubt, ask yourself, "Would many different 
teachers teach toward this objective in the same manner?" and 
"Would many different teachers evaluate the attainment of this 
objective in the same manner?" 

1. The student will understand how the library catalog system 
works. 

2. The students will acquire an appreciation of the contributions 
of minority cultures in America. 

3. The learner will be able to explain orally how the combustion 
engine works. 

4. The students will have a complete knowledge of the basic 
arithmetic skills. 

5. The learner will be able to define in words what is meant by 
"manifest destiny." 

6. The students will choose from a list of nineteenth-century 
Romantic composers those of non-German nationality. 



This time you should have no responses on the first, second, and 
fourth sentences; and yes responses for the third, fifth, and sixth. 
As any librarian will tell you, the library catalog system is very 
complex. One person's interpretation of "understanding the sys- 
tem" might simply be "the ability to locate a book, given a call 
number." Another teacher might ask for the recall of the category 
names associated with each number in the system. Hundreds of 
other interpretations are possible. The objective fails to commu- 
nicate unambiguously. No general agreement would exist regarding 



22 Educational Measurement 

what the student will be doing when he reaches the objective. The 
same general argument can be used to defend the no responses to 
the second and fourth statements. 

The other three statements ask students to "explain orally," 
"define in words," and "choose from a list." Each communicates in 
a generally understood and agreed upon manner. Yet even these 
would benefit by some additional explanation. For example, in 
statement three, "internal combustion engine" probably should be 
further defined, since many different kinds of internal combustion 
engines exist. 

Reviewing once again, the following points have been made: 

1. A behavioral objective is characterized by specificity and lack 
of ambiguity. 

2. If an objective is stated in behavioral terms, it will be inter- 
preted by most people in the same manner. 

3. If an objective is stated behavioraUy, it will be evaluated by 
most people in the same manner. 

4. A behavioral objective tells what the learner will be doing 
when he reaches the objective. 

Where are we going? How shall we get there? How will we know 
when we get there? The goal question . . . The instructional strategy 
question . . . The evaluation question. Respond in writing to this 
question: Which of the three questions listed above are answered 
by the following objective: The student shall learn to convert 
decimal fractions to percents and show this skill by scoring 90 per- 
cent or more correct on a mastery test approved by a committee 
consisting of mathematics teachers. 



The goal part includes everything up to the word "percents." 
Everything after "percents" is part of the evaluation statement. 
The objective says nothing about the instructional strategy. A 
behavioral objective is an attempt to state the expected outcome 
and the evaluation technique in unambiguous language. Behavioral 
objectives do not usually allude to the instructional strategy. 

Just to make sure you understand this idea, complete the follow- 
ing exercise on paper. For each of the objectives listed below, 
answer: (a) does it contain a statement of both the goal and the 



Thinking Positively About Behavioral Objectives 23 

evaluation of the goal; (b) does it contain a statement of goal only; 
or (c) does it contain a statement of evaluation only? 

1. A ninth-grade geometry student should be able to correctly 
write at least 30 out of 35 proofs on the final test provided by 
the textbook publisher. 

2. To improve penmanship skills of fourth-grade students. 

3. The students should acquire in high school Driver's Education 
class a knowledge of the fundamental rules of the road so that 
90 percent of the students pass the written state qualifying 
test for a permit. 

4. To develop in sixth-grade students the ability to read longi- 
tude and latitude on an atlas or globe. 

5. In a boys' eleventh-grade swimming class: To develop breast 
stroke performance so that 85 percent of the students can 
swim the length of the pool in less than 3 minutes. 

6. To develop in seventh-grade English students the skill of 
diagramming sentences so that at least 75 percent of them 
responding to a teacher-made written test can correctly dia- 
gram 12 out of 15 sentences. 

7. To provide exposure to basic office tasks and development of 
these tasks in a high school business class. 

8. To improve a capella singing in junior high choral classes. 



The list above contains four complete objectives, four statements 
of goal only, and no statements of evaluation only. The complete 
statements are numbers one, three, five, and six; and the goal-only 
statements are the remaining four. If you are having any trouble 
at this point, go back and read through the eight statements again. 
Try to compare the four which are statements of goals only to the 
four which are complete. 

At this point, you should be getting a fairly clear picture of the 
distinction between behavioral objectives and those not stated in 
behavioral terms. The next task is to refine your objective writing 
abilities with a few technical suggestions. To start off, why don't 
you write down your definition of behavior. What is a behavior? A 



24 Educational Measurement 

gold-plated psychological definition is not required. Just give your 
definition, in your own words. 



Do you know what the word perceive means? It means "to have 
knowledge of through any of the senses." One good working defini- 
tion of a behavior uses the word perceive: A behavior is any per- 
ceivable activity engaged in by the person you are observing. So if 
the person you are observing is doing something, and you are able 
to have knowledge of it through one of your senses, whatever it is 
he is doing is called a behavior. 

Next, consider the notion of terminal behavior. To terminate 
some act means to end it; thus terminal behavior implies the ending 
behavior. When a behavioral objective is used in the sense of a 
contract or of a prediction, the terminal behavior is that perform- 
ance expected of the learner at the end of the period of instruction. 
It's the behavior you expect him to show you. Two points regarding 
the concept of terminal behavior should be made. 

First, as mentioned earlier, the objective states the terminal 
behavior expected, but not the method whereby the teacher reaches 
that behavior. Your district's curriculum guide or the teachers' 
edition of the textbook in question might suggest instructional 
strategies, but the behaviorally stated objective usually speaks only 
to the goal and the evaluation standards. 

Second, the learning of a particular concept can involve a whole 
series of sequential or hierarchical terminal behaviors. Later in this 
chapter a more complicated example from a social studies course 
will be given, but for now, consider this example from elementary 
school mathematics: 

The student shall be able to multiply (using only pencil and paper) 
any five-digit number (including up to three decimal positions) by 
any four-digit number (including up to two decimal positions). 
The student shall be able to complete at least 80 percent of such 
problems without error. 

This is a complex objective. A whole series of bridges need to be 
crossed by the student before he can possibly reach the terminal 
behavior expressed. To start with, he needs to recall the arithmetic 



Thinking Positively About Behavioral Objectives 25 

"facts" — 6 X 4 and 7x9, and so forth. Even before that, though, 
he needs to recognize that the numeral "3" symbolizes a grouping 
of three objects, and the meaning of the other numerals. Write 
down a prior behavioral objective — one which probably would need 
to be accomplished before the student moves on to the multiplica- 
tion of five- and four-digit numbers, both of which may involve a 
decimal. Try to write this prior objective in behavioral terms. 



You might have changed the number of digits in one of the 
numbers. For example, perhaps you altered the "five" and/or the 
"four" to some smaller value. You might have decided that multi- 
plication of numbers Uke this without decimal points is a necessary 
precondition for multiplying with decimal values. Maybe you 
started back with two one-digit numbers which involved decimals. 
Check your objective for its "behavioralness." Would most teachers 
interpret in the same manner? Would the evaluation efforts of most 
teachers be similar? Does it have a statement of a goal and a state- 
ment of the evaluation of that goal? Does it tell what the student 
will be doing when he reaches the objective? 

The topic is terminal behavior, but do not interpret this to mean 
that every learning sequence has only a single behaviorally stated 
terminal behavior, or that the behavioral statement will catalog 
each and every prior step. The objective regarding multiplication 
of five- and four-digit numbers involving decimals could (and 
probably should) be dissected much further into the series of 
sequential and hierarchical tasks which lead to the final terminal 
behavior. 

Behavior has been defined. You now should recognize the mean- 
ing of terminal behavior and that a learning sequence may include 
a whole series of terminal behaviors. The next technical issue is 
that of the statement of conditions under which the behavior is to 
be manifested. Consider this objective: 

Given raw data, the student shall be able to compute the value of 
a product-moment correlation coefficient. 

Suppose you were a student in the class and the instructor just 
announced the objective written above. Do any questions come to 



26 Educational Measurement 

your mind about the conditions under which you will be expected 
to compute this correlation coefficient? Try to write one question 
regarding conditions on your paper. 



Now if you have never been introduced to a correlation coefficient 
the task was probably difficult. But if you have had even a minor 
amount of statistical training, your first question was probably, 
"Will the formula be provided?" The formula is a fairly long and 
complicated one. Actually, the "Given raw data" was a condition 
already in the original statement. The instructor may now change 
the objective to read: 

Given raw data and the computing formula, the student shall be 
able to ... . 

Other questions come to mind. The formula requires the computa- 
tion of square roots. "Will we have a square root table available?" 
The computations are tedious. "How many will be expected to be 
completed in an hour?" Everyone makes mistakes, especially when 
doing computations by hand. "Are we expected to get all questions 
correct?" "Will we have an adding machine, desk calculator, or 
computer to work with?" 

Of course, you can get carried away with these statements of 
conditions. "Given a standard school desk in a classroom of 32' by 
40' between 11:15 and 11:45 a.m. on a Tuesday with the room 
temperature not deviating by more than 4° from 74°F., the student 
shall be able to. . . ." State only those conditions which impinge in 
an obvious way on the manner in which the learner will show you 
whether or not he has reached the objective. Here are some exam- 
ples of what is meant by "impinge in an obvious manner" : 

Given the use of mathematical tables . . . 
Using Chapter 6 of your book as a guide . . . 
Given a list of 10 irregular verbs . . . 
Without the use of a dictionary . . . 
Without the use of notes 
Using only the given chart . . . 



Thinking Positively About Behavioral Objectives 27 

Look at the objective below. List on your paper the part or parts 
of the objective which are statements of conditions. 

Given a list of the fifty states, the student should be able to list 
the names of at least one of each state's two U.S. senators, without 
the use of books or references. 



". . . the student should be able to list the name . . ." is the 
statement of the goal. The part "of at least one of each state's 
two U.S. senators, . . ." is the criterion — a term which will be dis- 
cussed shortly. The other two parts — "Given a list of the fifty 
states, . . ." and "without the use of books or references . . ." are 
both statements of conditions. 

Tossing a condition or two into each behavioral objective you 
write is not tough at all. If you ordered a steak at a restaurant, 
the chef will probably toss a nice piece of fresh green parsley on the 
plate before it is served to you. The conditions are not Uke the 
parsley. The parsley is not an integral part of the meal (except for 
those aesthetically sensitive); but the conditions are an integral 
part of the objective. They must be reasonable. You must be pre- 
pared to defend the conditions which you impose as being har- 
monious with the overall goals of the instructional program. The 
conditions will be closely related to the instructional strategy used 
to reach the objective. The objective, you recall, gives only the goal 
and the type of evaluation which will be used. No mention of the 
instructional strategy will be made directly in the objective. But 
the instructional strategy and the overall goals of the instructional 
program must be kept in mind when iiiiposing the conditions. 

Why, for example, should the student recall the name of even 
one of the senators from each state? How does having this informa- 
tion immediately available in memory further the instructional 
program? Why not have the students recall the names of both 
senators? Why not have them match the names of senators with 
the names of the 50 states? For that matter, why not have them 
recall the names of the 50 states as well! 

The example is too superficial. Consider this objective from an 
educational psychology course for sophomores: 



28 Educational Measurement 

Without the use of notes or books, the student shall recall the 
definitions of and give an example of the following terms: rein- 
forcer, punisher, extinction, reinforcement, punishment, time out. 

"Why must I recall these? I'll just forget them soon!" A logical 
student question and one which is deserving of an honest reply. 
How does having the student recall these definitions fit into the 
total instructional program? Couldn't one give the student a 
duplicated list of terms and definitions for use as needed in the 
future? Perhaps the student could be encouraged to keep his text- 
book at the completion of the course for use later when a definition 
is required. 

The instructor who established the recall condition on the objec- 
tive should be prepared to justify it. Such justification might hinge 
on one of these thoughts: The objective is a means to an important 
end; it is an important end; or it is both. 

Assuming that the course in question has a marked "behaviorist" 
flavor, it can be assumed that these terms will occur frequently in 
lectures, discussions, and questions. The instructor might reason 
that the student must have the definitions immediately in memory 
as a means to an end. That is, the ability to instantly recall the 
definitions will be a necessary condition if the student is to take a 
meaningful part in the lectures and discussions, and ask meaningful 
questions. Another instructor might argue that knowledge of the 
definitions of these terms (in a recall format) is an important end 
in and of itself. 

The student has every right to know the conditions under which 
his performance will be evaluated. And this information should be 
known in advance of the actual evaluation. This need is shared by 
students at all levels — from graduate school to kindergarten. If the 
students know the conditions in advance, they will be able to 
challenge the instructors to justify the conditions which are im- 
posed. "Why must this be memorized?" "Why can't I just use a 
slide rule?" "Why can't I write my report instead of presenting it 
orally?" In short, the conditions must not be capricious. They must 
be reasonable in the face of the instructional strategies and overall 
objectives of the program. Before you establish a condition, be sure 
that you will be able to justify it to the learners involved. 

Before the next topic is introduced, a brief review seems appro- 
priate. Answer these on your paper: 



Thinking Positively About Behavioral Objectives 29 

1. For each of the following, write down the part of the statement which 
describes the conditions. 

a. The student will write a 200-word essay on a topic provided by 
the teacher. 

b. The student will be able to solve the following type of problem: 

2x + 3y = 17 
X — 4y = —6 

c. Using only his tool kit and general instruction manual, the learner 
will be able to adjust and measure the spark plug gap on any car. 

d. The students will be able to name all the bones of the leg and foot 
with 90 percent accuracy. 

2. When a teacher imposes a condition of performance on an objective 
he should be able to justify that condition in at least one of two ways. 
What are the two ways? 

3. "The learning of one concept may include a series of terminal 
behaviors." Explain what this means. 



Question 1-a has a pretty obvious condition imposed. The condi- 
tion is "on a topic provided by the teacher." If the student provides 
the topic, the condition is changed in a marked manner. The second 
and fourth parts (1-b and 1-d) have no statement of condition 
included. What you might have taken for a condition in 1-d ("with 
90 percent accuracy") will soon be introduced as a minimum 
acceptance level. The condition in 1-c is the first phrase up to the 
word "manual." For the second question, the teacher should be 
able to justify the condition either as a means to an end or as an 
important end in itself. Finally, a complex learning task may re- 
quire the specification of a whole series of sequential behavioral 
objectives, all of which would include a terminal behavior. 

The behavior . . . The terminal behavior . . . The conditions 
under which the terminal behavior will be manifested. The task is 
to learn how to construct behavioral objectives; and the goal part 
has now been covered. The next task is the evaluation part of the 
objective, and here the terms criterion measure (sometimes just 
shorthanded "criterion") and minimum acceptance level become 
the focus of attention. 

The "behavioralness" of a behavioral objective may be checked 
by answering the question, "Would most teachers evaluate the at- 



30 Educational Measurement 

tainment of the objective in approximately the same manner?" 
Remember from chapter 1 the poor performance by the three 
teachers' students when Marybelle inadvertently mixed up the 
three tests. The goal and conditions parts of the behavioral objec- 
tive need to be stated in specific and unambiguous language. The 
manner in which the attainment of the objective will be evaluated 
must also be very specific. 

By naming the criterion measure, you are identifying your 
instrumentation plan. An evaluation "instrument" could be a 
teacher-made test, a standardized test, an observation of the stu- 
dent performing some act, some sort of oral statement — in short, 
the student behavior which can be legitimately viewed as indicating 
the attainment of the objective's goal. Some examples of criterion 
measure statements in a behavioral objective: "on a test con- 
structed by the teacher," "on the XYZ standard mathematics 
achievement test," "by presenting a five minute extemporaneous 
talk to the class," "by taking ten minutes of dictation in shorthand 
and transcribing it to business letters," "by removing the spark- 
plugs, cleaning them, and replacing them in the car." 

Each of the above simply identifies the measure to be used to 
determine the attainment of the objective. Each stops short of 
making reference to a value judgment. None communicate to the 
teacher how he will separate satisfactory performance from that 
which is unsatisfactory. For example, consider this objective: 
"After six weeks in physical education, the students should be able 
to run one mile on level ground." The conditions are the "after six 
weeks" and "on level ground." The goal is to "run one mile" and in 
this case the goal also describes the criterion measure — namely the 
act of running one mile. But this particular physical education 
teacher has a small problem with the objective as it stands. Almost 
anyone can run a mile! Now some might run the mile in 528 
separate 10-foot "wind sprints," each of which is separated by a 
10-minute rest. Some may finish the mile in four minutes; others 
may require four days — ^but nearly everyone can run a mile. One 
wonders if this is what the physical education teacher had in mind. 

The teacher probably was interested in getting the students into 
good physical condition, and would therefore be concerned that 
they could run the mile within some commonly accepted time 
standard. He might complete the objective with a final phrase: 
"within eight minutes." The "run a mile" is the criterion measure; 
the "within eight minutes" is the minimum acceptance level. 



Thinking Positively About Behavioral Objectives 31 

Suppose a minimum acceptance level is now linked to the exam- 
ples given earlier. In each case, the criterion measure is given in 
regular print, and the minimum acceptance level is shown in italics: 

. . . by scoring 85 percent correct on a test constructed by the 
teacher. 

. . . by reaching at least the 50th percentile on the national 
norms of the XYZ standard mathematics achievement test. 

... by presenting a five-minute extemporaneous talk to the 
class judged to be satisfactory by unanimous consent of three 
speech teachers. 

... by taking ten minutes of dictation in shorthand and 
transcribing it to business letters within one hour and with less 
than three mistakes per page. 

... by removing the sparkplugs, cleaning them, and replacing 
them in the car within one hour and so that the car runs when 
the task is complete. 

Now you try some examples. For the first five questions, copy the 
criterion measure on one line, and the minimum acceptance level 
on the next. 

1. ... by swinuning two lengths of front crawl and two of back crawl 
within 15 minutes, stopping not more than twice. 

2. ... by writing a paper comparing the plots of two novels by the 
same author to the satisfaction of the teacher. 

3. ... by showing knowledge of traffic laws and signs in a 60-minute 
driving experience where two or less errors are made. 

4. ... by sewing a piece of clothing requiring a 7-inch zipper such that 
an outside judge cannot distinguish student's zipper from an example 
prepared by the teacher. 

5. ... by solving 12 addition problems in less than 15 minutes, making 
no more than two mistakes. 



On the next five only the criterion measure has been given. You 
supply a minimum acceptance level which seems reasonable. 

1. ... by leading a 30-minute discussion on a social problem 

2. ... by taking a spelling test of 20 words . . . 

3. ... by typing a business letter . . . 



32 Educational Measurement 

4. ... by writing a critique on the art film shown in class . . . 

5. ... by measuring imknown masses of approximately 5.00 grams 
weight on two different balances . . . 



Here are the answers for the first five questions, and some sug- 
gested completions for the second five. 

1. CM: by swimming two lengths of front crawl and two of back crawl 
MAL: within 15 minutes, stopping not more than twice 

2. CM: by writing a paper comparing the plots of two novels by the 

same author 
MAL: to the satisfaction of the teacher 

3. CM: by showing knowledge of traffic laws and signs in a 60-minute 

driving experience 
MAL: where two or less errors are made 

4. CM: by sewing a piece of clothing requiring a 7-inch zipper 
MAL: such that an outside judge cannot distinguish student's zipper 

from an example prepared by the teacher 

5. CM: by solving 12 addition problems 

MAL: in less than 15 minutes, making no more than two mistakes 

Some suggested completions: 

1. . . . where at least three problems are introduced; or where students 
do 80 percent of talking 

2. . . . with less than two errors 

3. . . . within 15 minutes; or with erasures 

4. . . . mentioning and contrasting it to at least three works of art in 
diiTerent time periods 

5. . . . within 3 percent accuracy for each measurement 

A couple of cautions should be implanted in your mind at this 
point. First, the criterion measure should "fit" the goal. Second, 
the minimum acceptance level should be reasonable and defensible. 

Look at this objective and decide if you think the criterion 
measure "fits" the goal: "The student should learn to make raisin 
cookies and demonstrate this ability on a twenty-item multiple 
choice test." The goal is for the student to bake cookies (raisin, at 
that); but the measure is performance on a paper and pencil test. 



Thinking Positively About Behavioral Objectives 33 

The measure doesn't fit. Wouldn't it be better to have the student 
demonstrate the achievement of the objective by actually baking 
the cookies? 

The example is silly, but equally silly and, unfortunately, all too 
real examples from the schools can be found. For example, can you 
think of times when a student's performance in gym is measured by 
a cognitive test; a student's ability to use the microscope in biology 
is measured by asking him to match the names of the parts with 
letters marked on a picture of a microscope; the "improvement in 
the faculty's instructional capabilities" is measured by the age of 
the faculty or the highest degrees held by most? 

The first cautionary note: Make sure the criterion measure you 
choose fits the goal. 

Next, the minimum acceptance level should be both reasonable 
and defensible. Choose the completion of each of the following three 
objectives which you think is most reasonable and defensible. 

1. Without references, the student should be able to name the parts of 
a flower: 

a. within 40 seconds. 

b. naming correctly 18 out of 20 parts. 

c. naming all 20 parts without error. 

2. The student should be able to type the letter provided by the teacher: 

a. without errors. 

b. with no more than three neatly corrected errors. 

0. to uniform darkness and deviating no more than 1/16 inch from 
the margins. 

3. The fourth graders should know the addition, subtraction, and 
multiplication facts for the numerals 1 through 10: 

a. with at least 85 percent correct. 

b. with less than 2 percent wrong. 

c. without error. 



Unless the teacher had some strong and compelling reason to 
demand absolute mastery, it seems that response "b" is more 
reasonable and defensible than "c" in the first item. Establishing a 
time standard, as suggested by "a" seems completely out of har- 
mony with the task at hand. In the second question, the "b" 
response seems most appropriate. Demanding errorless typing 
would fail most of us, and uniform darkness and the amount of 



34 Educational Measurement 

deviance from margins are both really functions of the machine one 
uses, rather than functions of the typist. Both are unreasonable and 
nondefensible. 

The last question deserves some comment. The topic is the num- 
ber facts from 1 to 10 for addition, subtraction, and multiplication. 
A frequently used minimum acceptance level is the 85 percent 
mastery mentioned in the "a" response to the item. In the caution 
currently under discussion, you have been asked to make the 
minimum acceptance level reasonable and defensible. Is it reason- 
able to ask 85 percent mastery of these facts? Is it reasonable to 
ask 98 percent mastery, as suggested by "b"? Or should you de- 
mand absolute mastery — 100 percent correct? 

These are fourth-grade students, remember. So much of what 
they do in fourth-grade arithmetic and later will depend on these 
facts. Adding columns of numbers, subtraction, multiplication of 
numbers greater than one digit, long division, computing interest, 
or doing "story problems" all depend on knowledge of the number 
facts. The 100 percent mastery specification seems both reasonable 
and defensible in this case. 

Thus the second cautionary note: The minimimi acceptance level 
must be both reasonable and defensible. A corollary to this: Make 
the degree of mastery fit the importance of the task. Don't fix your 
mind on one percentage figure and refuse to deviate from it. 

Assuming your problems with the presentation have been mini- 
mal to this point, you are now promoted out of basic training and 
assigned the status of intern in behavioral objectives. Before you 
are raised to the status of specialist in behavioral objectives it is 
time for you to review with some real patients. Divide each of the 
following statements into four component parts, as follows: 

a. Which part (if any) states the goal? 

b. Which part (if any) states the condition(s)? 

c. Which part (if any) describes the criterion measure? 

d. Which part (if any) describes the minimum acceptance 
level? 

1. Without books or notes the student shall demonstrate he has read 
Tom Sawyer by taking an in-class essay test and answering at least 
two out of four questions to the teacher's satisfaction. 

2. The student shall show he has mastered division skills by solving ten 
division problems at the blackboard with no more than one mistake. 

3. The student shall show he knows how to use the library by locating 
five references without outside help and locating correctly at least 
four of them. 



Thinking Positively About Behavioral Objectives 35 

4. Using any notes, the student shall demonstrate understanding of the 
Constitution by taking the required state objective test and scoring 
70 percent or above correct. 

5. The student shall show knowledge of the ten different types of 
triangles by correctly identifying at least nine when shown the 
shapes on the blackboard. 



Here are the answers: 

conditions goal 

1. [Without books or notes | | the student shall demonstrate he has 

criterion measure 
read Tom Sawyer j | by taking an in-class essay test] [and an- 

minimum acceptance level 
swering at least two out of four questions to the teacher's 
satisfaction. | 

goal 

2. [The student shall show he has mastered division skills | |by 

CM conditions 

solving ten division problems I I at the blac kboard] [with no more 

than one mistake.! 

goal 

3. I The student shall show he knows how to use the library] 

CM conditions 

]by locating five references] | without outside help] ] and locating 

MAL 
correctly at least four of them. ] 

conditions goal 

4. ] Using any notes] ]the student shall demonstrate understanding 

CM 
of the Constitution] ]by taking the required state objective test] 

MAL 
]and scoring 70 percent or above correct.] 

goal 

5. [The student shall show knowledge of the ten different types 

CM MAL 

of triangles] ]by correctly identifying] ]at least nine] ]when 

CM conditions 

shown the shapes] ]on the blackboard.] 



36 Educational Measurement 

Many critics of the behavioral objective movement contend that 
this technique will result in an emphasis on low levels of learning 
(recall, recognition, comprehension) at the expense of higher order 
things. (application, analysis, synthesis). Criticisms, you remember, 
have been delayed for chapter 3; but the response to this criticism 
is important enough to demand inclusion as a cautionary note in 
this chapter. Behavioral objectives can span the entire range of 
cognitive, affective, and psychomotor objectives. While behavioral 
objectives at the recall and recognition levels are easiest to write, 
an example later in this chapter will show how objectives can be 
written for all cognitive levels. 

The problem might be more clearly understood if the notion of 
"cognitive levels" was made more expUcit. Fortunately a classifica- 
tion system for these various levels has been developed.^ The sys- 
tem, or taxonomy, separates the cognitive area into six hierarchical 
levels. The levels are hierarchical in that each higher level depends 
on or subsumes the prior attainment of any level below it. Although 
you are directed to the primary source for a complete description, 
briefly the six levels, beginning at the lowest and moving to the 
highest, are as follows: 

1. Knowledge. When you ask a student to indicate his knowl- 
edge, you are asking him to remember material which was learned 
previously. Objectives at the knowledge level can be recognized by 
the inclusion of verbs such as "recall," "recognize," "define," 
"match," "name," or "identify." In each case, the process involves 
a student first learning something, and then showing you, in some 
manner, that he has that knowledge. 

2. Comprehension. Usually you will want more than just knowl- 
edge; you look for understanding as well. When a student compre- 
hends an idea he should be able to do certain things with it. For 
example, he should be able to restate the idea in his own words; 
interpret it; or suggest how it will apply in a new or different 
situation. Objectives at the comprehension level can be recognized 
by the inclusion of phrases such as "give further examples of," 
"state in his own words," "explain," or "predict." Comprehension, 
then, is at the first level of understanding. 

3. Application. A large proportion of the classroom learning 
program is directed toward developing in the student usable skills, 
techniques, and attitudes. The teachers hope the student will learn 
to apply what he has learned to his own life. Remember, though, 

2. Benjamin S. Bloom, ed., Taxonomy of Educational Objectives, Hand- 
book I: Cognitive Domain (New York: David McKay, 1956). 



Thinking Positively About Behavioral Objectives 37 

that the levels of the taxonomy are hierarchical. A student cannot 
apply information until he comprehends it; and he cannot compre- 
hend it until he knows it. 

Application level objectives are fairly easy to recognize. Solving 
problems, computing answers, and demonstrating principles are 
just a few examples. The application level is an important one, for 
within it the student applies previously learned information to 
new situations which, it seems, is a primary overall objective in 
education. 

4. Analysis. "Let's analyze this situation." You've heard that — it 
means "take it apart." When you ask a student to analyze a prob- 
lem, you expect him to break it into its component parts. Analyze 
a problem or situation. Break it down. Divide it up into its com- 
ponent parts . . . little pieces out of a big chunk. An analysis level 
objective demands that the student be able to take apart some 
problem, concept, or situation. 

5. Synthesis. The opposite of "take apart"? Put it together — 
synthesize. The events and pressures in the student's life are not 
independent of each other even when they appear to be. He must 
synthesize information from a wide variety of sources, and occa- 
sionally this information is contradictory. Whereas analysis re- 
quires the student to take a problem apart, synthesis requires 
pulling together a variety of ideas, viewpoints, problems, or con- 
cepts and placing them in a single framework. Ideally, a university's 
comprehensive examination at the masters or doctoral level should 
focus on the synthesis level, demanding that the student assemble 
all his various information inputs into one general framework. 
When you write synthesis objectives you probably will include 
words such as "categorize," "combine," "pull together," "organize," 
or "reorganize." 

6. Evaluation. To assess means to measure the amount of; to 
evaluate means to assess — and judge. Evaluation implies aU of the 
preceding levels: having knowledge, comprehending, applying, an- 
alyzing, and synthesizing. At the final level the student evaluates 
or makes a judgment about the concept at hand. Generally, when 
one judges it is in the context of "compared to" — it is on the basis 
of some set of criteria. You may, in some situations, provide the 
framework to the student. At other times his judgments will be 
made on the basis of criteria which he has established within 
himself. 

In writing behavioral objectives, avoid the error of continually 
emphasizing the knowledge and comprehension categories to the 
exclusion of the others — ^unless, of course, the knowledge and 



38 Educational Measurement 

comprehension objectives are legitimately the primary emphasis of 
your particular area of instruction. This, however, is rarely the 
case. Most instructional programs will have objectives at nearly all 
of the cognitive levels. 

In addition to the cognitive hierarchical levels, a taxonomy of 
affective objectives has been developed.^ The affective area includes 
such things as interests, attitudes, feelings, emotional set, and 
appreciations. Just as in the cognitive case, the affective area has 
been divided into categories. Receiving is the first. Before you can 
have any effect on a student he must be willing to receive your 
attention. A student who is receiving is listening, or selecting for 
attention, or is paying attention. 

Responding is the second affective level. When a student re- 
sponds, he goes further than simply paying attention; he responds 
to or participates in this process. The student might respond be- 
cause he wants to or he might respond because (he thinks) he has 
to. After the responding level comes the valuing level. How much 
does he appreciate a certain concept or idea? What is his attitude 
toward it? 

A person has many values. Sometimes the values come into con- 
flict with each other. A person must do more than just develop 
values; he must make them internally consistent if a sense of 
personal stability is to result. The notion of organizing values into 
an internally consistent system is assumed under the heading of 
organization, which follows valuing in the affective taxonomy. 

The final category is characterization by a value or value com- 
plex. First the person develops values; then makes the values 
internally consistent; under the final category, the person's behav- 
ior (his life style) is under the control of this value system. If one's 
behavior is controlled by a stable and consistent system of values, 
the behavior tends to be both pervasive and relatively predictable. 
The cognitive domain centers on thinking tasks and measures in 
this area primarily based on an achievement or aptitude test of 
some kind. The affective domain centers on feelings — the students' 
attitudes, opinions, interests, and personality. The final domain is 
the psychomotor domain. The simplest kind of psychomotor activ- 
ity involves imitation ("How big is baby? Sooooo big!"). A student 
is expected to imitate the teacher's examples in the process of 
learning to copy letters and numbers. But the student must also 

3. D. R. Krathwohl, B. S. Bloom, and B. B. Masia. Taxonomy of Educa- 
tional Objectives, Handbook II: Affective Domain (New York: David Mc- 
Kay, 1964). 



Thinking Positively About Behavioral Objectives 39 

learn to write letters, words, and numbers without the teacher 
providing an example. The student must learn to make physical 
movements with some precision in activities ranging all the way 
from handwriting to recreational games or typing. Objectives in the 
psychomotor domain would include, "writes name legibly," "prints 
letters of alphabet," "writes from left to right," or "is capable of 
operating an automobile safely." 

The goal of this section has been to caution you against specify- 
ing only lower level cognitive objectives at the expense of higher 
level cognitive objectives. Affective and psychomotor objectives 
should be included when appropriate. The descriptions of the cogni- 
tive and affective categories have been necessarily brief, and poten- 
tial users are urged to consult the primary sources for more detail. 

Before going on, though, try your hand at categorizing the fol- 
lowing objectives. Number your page 1 through 12 and make two 
columns — one headed "Type," the other, "Level" — for each num- 
ber. First, decide if the objective is cognitive, affective, or psycho- 
motor. For the cognitive and affective objectives, also decide in 
which of the hierarchical categories the objective belongs. 

EXAMPLE: The student should recall the names of the 
continents without error. 
TYPE: Cognitive LEVEL: Knowledge 

1. The student should be able to categorize and name the parts of an 
unfamiliar business letter. 

2. The student should be able to critique and appraise another stu- 
dent's German composition on the basis of its organization. 

3. The student should be able to solve a math problem which requires 
both addition and subtraction, although he has previously solved 
problems using only one at a time. 

4. The student should demonstrate his interest in the topic presented 
by the guest speaker by asking questions. 

5. The student shall paint a picture that demonstrates mastery of the 
principles of shading and perspective. 

6. The student should be able to give an example of a mathematical 
induction proof. 

7. The student should be able to carry out a lab experiment involving 
a new problem which is based on principles already familiar to him. 

8. The student will turn in his team project by the date it is due. 

9. The student should be able to identify the designated parts of the 
sewing machine. 



40 Educational Measurement 

10. The student should integrate political, religious, economic, and 
social concerns into a discussion of modem education. 

11. The student shall be able to play a 25-measure musical composition 
in a major key to the satisfaction of the teacher. 

12. The student should show in his lab write-up that he recognizes the 
need for structure in his physics lab. 



Ten out of twelve correct is a satisfactory score. Differences of 
opinion can exist, and the placement of an objective in two different 
categories can often be defended. The answers are as follows: 

TYPE LEVEL 

1. Cognitive Analysis 

2. Cognitive Evaluation 

3. Cognitive Application 

4. Affective Receiving 

5. Psychomotor 

6. Cognitive Comprehension 

7. Cognitive Application 

8. Affective Responding 

9. Cognitive Knowledge 

10. Cognitive Synthesis 

11. Psychomotor 

12. Affective Organization 

Frequently the most important objectives are the ones which are 
most difficult to translate into behavioral language. Although the 
task is difficult it is usually possible to be at least partially success- 
ful. Consider, for example, the two examples which follow. One is 
from a political science class, the other from English literature. 

Among the objectives which are important to a particular politi- 
cal science teacher is one dealing with the interpretation of political 
statements. This teacher believes a student should be able to read 
current information, interpret it, and make rational judgments 
about it. The objective is an important one. It is also very general 
and probably would be interpreted and evaluated in a wide variety 
of manners by an array of teachers. Here is how the general objec- 
tive might be translated into behavioral language: 



Thinking Positively About Behavioral Objectives 41 

KNOWLEDGE LEVEL: The student should recall the names of 
five leading political theorists and recognize statements which 
describe the general philosophical position of each. (Note: The 
instructor would probably present this information to the student 
in a printed or lecture format.) 

COMPREHENSION LEVEL: The student should be able to state 
the general philosophical position of the five leaders in his own 
words to the satisfaction of the instructor. 

APPLICATION LEVEL: Given new and unfamiliar statements by 
each of the five theorists the student shall match the statement to 
the name of the theorist. 

SYNTHESIS LEVEL: (a) Given a political editorial from a news- 
paper, the student should be able to identify the theorist who would 
be most in agreement, and the theorist who would be in greatest 
disagreement, (b) The student should be able to compare the five 
theorists, giving both similarities and differences in their philos- 
ophies. (Each of these would be measured by standards agreeable 
to the experts in the field, who, in this case, probably would be 
interpreted as the political science faculty at this school.) 

EVALUATION LEVEL: Given a political editorial from a news- 
paper, the student shall be able to identify in writing how the 
statement agrees and disagrees with his own political philosophy, 
where his own political philosophy is described in terms of the five 
theorists. 

Remember that the general objective was: "A student should be 
able to read current information, interpret it, and make rational 
judgments about it." Note how the behavioral statements make the 
meaning of this general statement unambiguous. Perhaps you do 
not agree with the interpretation. The fact that you now know 
how the general statement is going to be specifically interpreted is 
one of the strengths of doing things in behavioral language. When 
the statement was left in general terms, you did not have enough 
information to decide whether or not you were in agreement. 

The process of defining levels in the taxonomy is dangerous. For 
example, under the application heading the objective asks that the 
student read a new statement by each theorist and identify the 
theorist from the viewpoint. Now suppose the teacher is disrepu- 
table and sneaks the information to the student, or the student is 



42 Educational Measurement 

interested enough in the topic to do extensive outside reading. In 
either case what was supposed to be new information is really 
familiar to the student. Therefore, the objective is no longer the 
appUcation of a previously learned amount of information, but is 
a knowledge level objective, since the student would simply be 
recognizing or recalling something previously learned. 

The second example is drawn from a literature class, presumably 
somewhere at the secondary level. The class is reading Huckleberry 
Finn. The teacher has this general goal: That the student under- 
stand the basic techniques and thematic implications of Huckle- 
berry Finn. In more specific language, some objectives might be: 

KNOWLEDGE: Name and discuss briefly five of the different 
people Huck pretends to be on various occasions. 

COMPREHENSION: Cite four examples of satire that Twain uses 
in Huckleberry Finn. 

APPLICATION: Judging from his behavior, statements, and the 
rationale he seems to employ, what would you hypothesize to be 
Huck's concept of God? 

ANALYSIS: Including Pap and Jim as specific examples, discuss 
the concept of the "Father image" and authority as they relate to 
Huck. Explain the relationship of this with the novel's overall 
themes. 

SYNTHESIS: Pretend that you are Huckleberry Finn and you 
have been asked to give a speech relating your ideas about hypoc- 
risy in humanity. Write an essay outUning your views. 

EVALUATION: In light of contrasting ideas such as the responsi- 
bility of man to his society, and the responsibility of the individual 
to himself, argue for or against the style of life that Huck's philos- 
ophy would idealize. Use examples to support your case. 



Questions 

Choose a general objective from elementary school arithmetic. Trans- 
late it to behavioral terms at various cognitive levels as has been 
done in the two examples. 

Here is a very general objective from a college level educational 
psychology class: The student shall become sensitive to individual 



Thinking Positively About Behavioral Objectives 43 

differences in the elementary school classroom. Try to translate the 
general objective to behavioral language. 



In Support of Behavioral Objectives 

A learning program is designed to bring about some sort of change 
in one or more students. A reading specialist designs a program to 
help Johnny differentiate among b, d, and p; a biology teacher 
specifies activities so that students will learn to focus a microscope 
and make observations of unknown specimens; students are re- 
quired to study the Constitution of the United States because it is 
hoped that this will affect their behavior; and the federal govern- 
ment provides funds for a large reading program in a depressed 
area because possible benefits may accrue. 

You identify an educational problem. You come up with a 
possible solution to the problem. You set about trying out your 
solution with one or more students, or at least in some educational 
setting. If your goals are ambiguous, how will anyone ever know 
if your attempt was successful? 

For example, you might say: "I want to do the right thing for 
these kids." Great! Now, when your program is done you can look 
for all the positive effects you observe (ignoring any failures) and 
say: "That's what I was trying to do." That's called a "cop-out." 

The use of behavioral objectives is designed to improve commu- 
nications. If you identify and specify your objectives in behavioral 
language, three important things happen: 

1. Everyone interprets what you are trying to accomplish in 
approximately the same manner. The confusion described in the 
play in chapter 1 is avoided. 

2. Everyone has a chance to decide whether or not they agree 
with what you are trying to do. An outsider cannot disagree with 
what you are trying to do if he can't even understand what your 
efforts are. You open yourself up for constructive criticism, which 
seems like a perfectly reasonable thing for a professional to do. 

3. The success or failure of various parts of your program will 
become quite obvious to all who would look. You would not need to 
depend on testimonials or the predictions of authorities. Since a 
behaviorally stated outcome is in terms of observable behavior, one 



44 Educational Measurement 

needs only to see if the behavior occurs to determine if the program 
is successful. 

Consider the other side of the coin. Suppose you specify your 
outcomes in very vague and ambiguous language. 

1. Different outsiders will interpret your efforts in light of their 
own interpretation of the vague statement. Different people will be 
expecting you to be attempting to do a wide variety of things in the 
program. 

2. No one will really know specifically what it is you are trying 
to do. Thus, no one can really agree with you; and no one can 
really disagree with you. Whenever someone disagrees, you re- 
spond: "That's not what I was trying to do." 

3. Since no one really knew what you were trying to accomplish, 
the results of the program can be used to support or negate your 
original objectives, depending upon how the user happened to 
interpret your original statements. 

Behavioral objectives are a vehicle for improving communication. 
If progress is to be made, all parties — student, teacher, test writer, 
taxpayer — must have a clear, common interpretation of educational 
goals. 

The goal of this chapter has been to show you how to construct a 
behavioral objective, and how to evaluate a behavioral objective 
constructed by others. A behavioral statement states the goals, tells 
how the attainment of the goal will be evaluated, and the condi- 
tions under which it will be evaluated. The behavioral statement 
does not speak to the instructional strategy which will be involved. 
View the behavioral objectives movement as a technique for im- 
proving communications rather than as a new educational philos- 
ophy. Not everyone agrees that the movement toward behavioral 
objectives will have a salutary effect. They have their day in court 
next. 



Summary 

The goal of a behavioral objective is to make a communication 
specific and unambiguous. If a statement is specific, nearly every- 
one using it will have approximately the same interpretation of the 
meaning. It should tell what the student will be doing when the ob- 
jective is reached. The verb in the behavioral statement should 
describe an activity engaged in by the person you are observing. 



Thinking Positively About Behavioral Objectives 45 

A behavioral objective will describe the terminal or final behavior 
of the students. Of course, a whole series of other behaviors may be 
required before the student can show the terminal behavior. The 
behavioral statement does not speak to instructional strategy, only 
to the terminal behavior desired. The statement should describe 
important conditions under which the learner will be asked to 
manifest the desired terminal behavior. The conditions should not 
be capriciously imposed. The teacher should have reasons for im- 
posing each of the conditions — why should one formula be mem- 
orized when books are allowed in other cases? 

The criterion measure indicates to what degree the student has 
obtained the desired objective. Usually absolute perfection is not 
required in a task, so most behavioral objectives contain a minimum 
acceptance level. Once again, the minimum acceptance level must 
be reasonable and fit the situation. Sometimes perfection should 
be required; at other times 50 percent accuracy may be good 
enough. 

Behavioral objectives are not limited to the lower level cognitive 
types of recall and recognition questions. Objectives can be written 
in behavioral language at all six cognitive levels — knowledge, 
comprehension, application, analysis, synthesis, and evaluation. Be- 
havioral objectives can also be written in the affective and psycho- 
motor areas. 



Behavioral Objectives: 
The Other Side of the Coin 



The author has spoken on the topic of behavioral objectives to a 
wide variety of audiences. Undergraduates and graduate students, 
teachers in the field, school administrators, and people in business 
and industry all seem interested in learning about this technique. 
Invariably, questions arise following a talk — questions which reflect 
certain fears about some possibly unhappy consequences of the 
widespread use of behavioral objectives. It is of interest that the 
doubts expressed by students usually revolve around the possi- 
ble unfairness of the technique with certain people; the doubts of 
in-service teachers and administrators center on matters of prac- 
ticality; and the doubts of industrial managers focus on cost- 
effectiveness. 

Fairness 

Suppose the issue of fairness is considered first. In chapter 1 a 
farce was presented. In this chapter another story will be used — 
suppose we label it a morality story. The story starts when the six 
speech teachers at XYZ High School are required by the Superin- 
tendent to gather together and translate the general objectives for 
the speech class into behavioral language. 

Meetings, meetings, meetings! The process was taking much 
longer than anticipated. All were growing restless with the 
task. The Superintendent had issued a mandate that each 

46 



Behavioral Objectives 



47 



department present to him a list of general objectives for its 
instructional program, translated into behavioral language. 

A series of general statements had been completed in the 
earlier meetings. The only remaining task was to translate each 
into behavioral language. The translation part of the task was 
moving along satisfactorily. 

Teacher A: We're getting near the end now. 

Teacher B: Outstanding! I suggest we all adjourn for pizza and 
beer when we're done. 

Teacher A: Agreed. This general objective was presented by 
Teacher C: "The student shall demonstrate poise 
and self-confidence while deUvering a five-minute 
extemporaneous speech." What do you mean by 
"poise and self-confidence"? (Question directed at 
Teacher C.) 

Teacher C: Well, you know . . . looks comfortable and not 
nervous . . . sort of relaxed and poised . . . self-con- 
fident . . . you know. 

(Note to Reader: Behaviorally speaking, these statements just 
aren't making it!) 

Teacher A: I know — but what behaviors would a "poised and 
self-confident" speaker have? What would a "poised 
and self-confident" speaker be doing? How do we 
differentiate his behavior from a not "poised and 
self-confident" speaker? 

Teacher D: (Thinking of the beer and pizza and trying to move 
the task along quickly.) One thing we might expect 
the student to do is make eye contact with the 
audience. One mark of self-confidence is eye contact. 

Teacher A: Now we're getting somewhere. Just one eye contact 
in a five-minute speech? 

Teacher D: No — much more. Let's say that the speaker must be 
making eye contact at least half the time, and that 
the eye contact must be with more than five different 
people. 

Teacher A: (Writing on the blackboard the general objective and 
the first translation of the general objective to be- 
havioral language beneath it.) "One manifestation of 
poise and self-confidence will be shown through eye 



48 



Educational Measurement 



contact, where the student will be expected to be 
making eye contact with at least five different people 
during the talk. This eye contact will be manifest at 
least half of the time, as measured by the teacher." 

Teacher E: The sweep of Milton! Brilliant! 

Teacher C: (Trying to make up for the earlier inability to focus 
on a behavior.) A poised speaker doesn't stutter and 
stammer — doesn't have long embarrassing pauses — 
doesn't say "uh" very often — ^you know — or reuse the 
same words all the time. 

(You know — like people who say "you know" all the time!) 

Teacher A: (Translating.) Yousay "doesn't stutter or stammer." 
How about: "The student should not have more than 
five interruptions during the five-minute talk, where 
an interruption is defined as the use of "uh" or "you 
know" or other common words or phrases or pauses 
in excess of ten seconds." 

Teacher B: That sounds O.K. 

Teacher A: Any others? (Long pause.) All right. We're agreed on 
the behavioral translation for this one. Let's move 
on to . . . 

We cut away now. The list of behavioral translations was 
approved by the Superintendent and the school board and 
became the LAW of the school. In the year which followed, the 
presence of these behavioral statements resulted in some inter- 
esting ramifications. Consider these two: 

Ramification 1: Teacher G just completed his B.A. and 
M.A. at nearby New Age University. Teacher G believes it is 
detrimental in the long run to force the student to conform to 
any particular speaking style. He is convinced that speaking 
style is a natural outgrowth of the student's overall personal- 
ity, and any externally imposed changes will only lead to con- 
fusion within the student. The result of this confusion will be 
that the student's ability to effectively communicate informa- 
tion orally will be diminished. The communication, he reasons, 
is the important final outcome; the means and technique used 
to reach this outcome are immaterial. 

Teacher G proceeds to administer to his five classes of 
sophomore and junior speaking neophytes. If a student feels 



Behavioral Objectives 49 

like shuffling when he speaks, Mr. G lets him shuffle. If the 
student feels most comfortable looking out the window, pacing 
the floor, laughing, punctuating a talk with cuss words, or 
using any sort of unusual delivery style, Mr. G encourages him 
to do so. He simply determines if the observed behavior is the 
one in which the student is most comfortable. If the student 
responds in the affirmative, Mr. G encourages the student to 
proceed in that manner. 

Everything seemed to be going along just fine. The students 
liked Mr. G and Mr. G liked the students. The day of reckon- 
ing arrived in May. At that time, Mr. A, who doubles as de- 
partment chairman, paid an unscheduled visit to Mr. G's class. 
He asked for some five-minute extemporaneous talks by the 
students. As they responded, he kept score. (How much time 
in eye contact? With how many different people did the 
speaker make eye contact? How many uhs and you knows and 
long pauses during the speech?) Not surprisingly, since none of 
these things had ever been mentioned in Mr. G's classes, his 
students didn't look good when measured against the depart- 
ment criteria. 

Final word: Mr. G was in big trouble — and untenured at 
that. 



Questions 

1. Some would say the imposition of these standards (the eye contact 
and interruptions) was unfair to Mr. G. Others would argue that 
Mr. G was unfair to his students by not informing them of what the 
other speech experts felt were important measures of a poised and 
self-confident speaking presentation. Take a stand. Which do you 
support? 

2. Translate the problem from high school speech to elementary school 
math. Here departmental objectives would deal with things like the 
addition of columns of numbers, long division, and the subtraction 
of fractions of unlike denominators. If you felt the department was 
somewhat arbitrary in not allowing Mr. G to ignore the speech ob- 
jectives, would you also agree that an elementary school teacher 
could ignore these math objectives? 

Ramification 2: Louis was a big, clumsy sophomore. He 
lived in a small house near the downtown area with his parents, 



50 Educational Measurement 

three older brothers and two younger sisters. Louis didn't talk 
too much, but he listened well. He was very popular with his 
classmates, who chose him to be a class officer. His grades were 
excellent, especially in math and science. Judged against most 
rational criteria, his speaking style was atrocious. Louis was 
assigned to Mr. A's section of speech. 

When Louis gave speeches he had a way of shuffling around 
the front, looking at the floor most of the time. When he made 
a point, he would look up for just a moment to see if the others 
seemed to follow him, then look back to the floor. He spoke in 
complete and precise sentences, but paused for long periods 
of time in between to make sure they were correct. His sense of 
humor always came through. 

The students in the class always listened attentively when 
Louis spoke. 

Mr. A thought Louis was unwilling or unable to learn, as 
well as arrogant and aloof in refusing the professional guidance 
offered to him. 



Questions 

3. Some would say that the imposition of these standards was unfair to 
Louis. Others would argue that Mr. A was simply doing his job; 
namely, requiring Louis to conform to the standards set by a com- 
mittee of speech experts. Who do you tend to agree with? Why? 

4. Translate the problem from high school speech to elementary school 
math — ^the same situation posed in question 2 earlier. Rather than 
having a nonconforming speaking style, Louis develops some rather 
imique (and incorrect) techniques for handling the addition of num- 
bers, long division, and subtraction of fractions. Is it unfair for the 
teacher to evaluate Louis's performance on the basis of departmental 
standards in this situation? 

5. Each of the two pairs of questions (1, 2 and 3, 4) attempts to make 
a general point. State the general idea expressed by each. 



Any conversation about the "fairness" of behavioral objectives 
must consider both teacher and learner. Is it fair to the learner 
when objectives are not made specific? How many college teachers 
hide behind inconsequential performance indicators like classroom 



Behavioral Objectives 51 

attendance, memorization of trivia, and vague statements of gen- 
eralities? Wouldn't the student be better off if the instructor were 
asked to specifically define and defend his objectives? On the other 
hand, is it fair when he does specifically define his objectives and 
the student does not agree? Then the student is being evaluated on 
the basis of the objectives with which he disagrees — and that isn't 
fair either! 

Was it fair to Mr. G and to Louis when they were evaluated on 
the basis of an objective with which they did not agree? On the 
other hand, would it be any more fair to Mr. G if the department 
had not made its objectives specific? He would still be evaluated 
by the department head, but this time he would not have known in 
advance the basis upon which the evaluation was to be made. 

Nothing seems to be fair. Maybe the only solution is to close the 
schools. 

Like most other complex problems, the solution to this one is 
complicated. The fairness of specifying the objectives behaviorally 
depends on the situation at hand. The author agrees with Mr. G 
that a variety of speaking techniques exists, all of which lead to 
effective communication. Attempting to force either teacher or stu- 
dent to adopt a single set of criteria is fair to neither. The comment 
does not generalize, however. The criticism is tied to this one situa- 
tion. That is, in our opinion, there is no single, universally accepted 
speaking style for effective communication. The imposition of pre- 
cise criteria where the criteria quite possibly may be incorrect, or at 
least incomplete, seems unfair. 

However, at each level of schooling, from pre-school through 
graduate or professional school, some number of hoped-for out- 
comes which would be commonly accepted as desirable by educa- 
tion's constituency must exist. Ability to discriminate among small 
and capital letters; knowledge of the addition facts; ability to read 
a newspaper article; learning to add fractions; basic principles of 
our political system; ability to fill out a tax form or a job apphca- 
tion; and knowledge of the weights and measures used in this 
country — these are just a few of the objectives which are commonly 
accepted as desirable by education's constituency. To ask both 
student and teacher to conform to these seems fair. 

Caution is the key word. In any situation where there exists 
reasonable doubt about the accuracy or completeness of the trans- 
lation of a general objective to behavioral language, room for 
variation should be allowed. The speech department members 
might have specified those criteria which they considered to be 



52 Educational Measurement 

attributes of a good speech, but left room for, and encouraged, 
experimentation with other techniques. 

The speech department imposed certain criteria. Mr. G dis- 
agreed. Had the department not translated "poised and self-con- 
fident" to behavioral language, the disagreement might have 
occurred anyhow. However, with the translation of objectives to 
behavioral language, at least it could be specifically determined 
why they disagreed. Once the issue is brought out into the open, 
dialogue can be initiated. People may differ on the specific interpre- 
tation of a general statement; but not until each group clearly 
understands what the other is saying can a meaningful dialogue 
ensue. 



Question 

6. List five behavioral outcomes for a specific course which you think 
would be commonly accepted as desirable by nearly everyone. Com- 
pare your list to those of others who are familiar with the course. Do 
you have any honest disagreements? 



Practicality 

Another group of criticisms of the developing behavioral objectives 
movement revolves about matters of practicality and the possible 
deliterious effects of behavioral objectives. Consider first the notion 
of practicality — or more precisely impracticality. 

Many writers will tell the prospective teacher that good instruc- 
tion must be preceded by the specification of the objectives of that 
instruction. In addition, these specifications should be in behavioral 
language. At the same time the teacher-in-training is exhorted to 
treat each student as an individual. Presumably, the teacher is to 
state specific behavioral objectives for each individual's instruc- 
tional program. Presumably also, this task is to be done in each 
area of the instructional program — reading, math, science, social 
studies, and whatever else the particular school happens to stress. 

With thirty little (or big) individuals in the room, it can't be 
done. Not without a lot of help, anyhow. 

Here's an example to illustrate both the amount of time this 
process takes and the kind of help needed by the teacher. The 



Behavioral Objectives 53 

author recently completed writing three combination workbook and 
test books to accompany a basic elementary school science series.' 
The task of translating the instructional program into behavioral 
objectives, devising some sort of instructional strategy to reach the 
objectives, and writing appropriate practice exercises and test items 
for the objectives required approximately twenty to twenty-five 
hours per week over a one-year period. The author admits to a 
certain amount of sluggishness, but even one-half or one-fourth the 
expectation of that degree of time commitment on the part of a 
classroom teacher is totally unreasonable. After all, the classroom 
teacher also has reading, word skills, math, social studies, and 
classroom management decisions with which to deal. 

The science program described also illustrates the kind of help 
which must be provided if the individual teacher is to follow the 
well-meant directions of their professors. Asking the teacher to 
provide her own behavioral objectives and consequent individual- 
ized instructional strategies is impractical; but asking the teacher 
to individualize instruction when the objectives, strategies, practice 
exercises, and testing program are all provided is not unreasonable. 

To state that it is impractical to ask teachers to specify objec- 
tives in behavioral language implies that, given adequate time, the 
task could be completely accomplished. For two general reasons, 
this is not so. For one thing, although some objectives can be 
specified, the outcomes defy measurement. For example, a high 
school math teacher deals with loans, interest types, and per- 
centages. The objective states that the students will, later in life, 
use this knowledge to obtain the best possible loan. The outcome 
could be measured; but the difficulty would be legendary. Here's 
another example: The members of the social studies faculty prob- 
ably could come up with a long list of behaviorally stated outcomes 
which, in their opinion, exemplify good citizenship. But the mea- 
surement of most of these (votes, participates in community affairs, 
keeps abreast of current events, and respects the rights of others, 
to name a few) would also be troublesome. 

Not only is it difficult to measure many legitimate outcomes 
which can be behaviorally stated, it is sometimes difficult to trans- 
late a real objective into a defensible behavioral statement. Appre- 
ciation of good literature and good music are legitimate goals of 

1. J. G. Navarra, J. Zafforoni, and J. W. Wick, Workbook and Achievement 
Tests for The Young Scientist: His Experiments and Hypotheses, The Young 
Scientist: His Predictions and Tests, and The Young Scientist: His Problems 
and Methods (New York: Harper & Row, 1971). 



54 Educational Measurement 

both high school and college, as are developing good citizenship and 
appreciation of the rights of others. Such general goals are even 
more difficult to translate into a commonly accepted slate of behav- 
ioral statements than was the "poised and self-confident" goal in 
the speech class. 

The critics argue that the behavioral translations are impractical 
from three viewpoints. First, the task is too difficult for any given 
teacher. Second, many adjectives can be stated behaviorally, but 
the evaluation of the objective is very difficult. Third, many ob- 
jectives simply cannot be changed to a commonly accepted set of 
behavioral statements. 



Question 

7. In your opinion, do any (or all) of the three arguments have merit? 
Which is the most important? 



Cautionary Notes 

Another array of doubts expressed by in-service educators can be 
summarized under the general heading "behavioral objectives will 
be bad if they are used badly." The issues raised should be viewed 
more as cautionary notes than as criticisms. That is, the critics are 
saying "these things could happen if the instructional program 
becomes centered around the use of behavioral objectives." The 
response is not necessarily to cease using behavioral objectives; 
rather it is to avoid the predicted dire outcomes. 

One of these cautions centers on the opportunistic nature of the 
teaching process. If the teaching act involved only the dissemina- 
tion of information, skills, and techniques in a carefully planned for 
and prescribed manner, then the personality and individual char- 
acteristics of the teacher would not be as important as they are 
known to be. A good teacher seizes the conditions of the moment — 
conditions and events which cannot be anticipated. Plans and 
techniques which have worked with one group may function poorly 
with another, and the sensitive teacher makes in-flight decisions to 
change directions and alter plans. The opportunistic teacher knows 



Behavioral Objectives 55 

when to use humor and when to squelch it; when to input a diver- 
sion into a planned program; when to build on an unscheduled 
event; and when to draw on the special abilities found in the room. 
Some fear that the movement toward unambiguous specification of 
objectives will diminish this opportunistic dimension in the class- 
room. The notion is that the teachers will become so tied to the 
lists of behavioral objectives that they will refuse to deviate from 
it to "seize the moment." 

Closely related is the point that much of the learning which 
occurs in the classroom is serendipitous in nature. That pretty- 
sounding word means that something unplanned but pleasant 
happens as you attempt to fulfill another goal. Suppose, for exam- 
ple, you are driving across town to do your laundry. On the way, 
you see and meet an old friend. The major goal was the laundry. 
The serendipitous event was meeting the friend. Much serendip- 
itous learning occurs in the classroom. Students learn far more than 
that which appears in the curriculum guide. 

A good teacher is opportunistic. Much serendipitous learning 
does occur in the typical classroom. It would be an unhappy out- 
come if the occurrence of these two in the classroom was seriously 
threatened by the use of behavioral objectives. Thus, the note of 
caution: Don't let this happen. Teachers must not be given the 
idea that only that which is specified behaviorally can be part of 
the events of the classroom. 

Pursuing that point one step further, suppose a teacher does 
seize an opportunity to introduce a new concept or idea, and later 
feels that the effort was worthwhile. Nothing in the behavioral 
objective idea prevents the teacher from stating a behavioral objec- 
tive after the event has occurred. If the teacher is opportunistic and 
comes upon a learning technique which leads to valuable outcomes, 
she will probably want to repeat the scene with later students. If 
the outcome is important it should be part of each year's program. 

You learned to write behavioral objectives in chapter 2, and are 
therefore in a position to recognize that the knowledge and compre- 
hension level objectives in the cognitive domain are the easiest to 
write. As you move to the higher cognitive levels (analysis, synthe- 
sis, evaluation) and into the affective domain, the objectives don't 
flow from the pen with such great ease. This leads to another point 
of caution raised by critics of the behavioral objectives movement: 
"Since the lower level objectives are easiest to write, the widespread 
use of behavioral objectives in instructional programs will lead us 



56 Educational Measurement 

toward a concentration on trivia." This cautionary note also falls 
under the heading: "Behavioral objectives will be bad if used 
badly." The fear is that concentration on behaviorally stated objec- 
tives will lead education toward even more "knee jerk" learning. 

You know, the physician taps your knee just right and your leg 
jerks. Something inside there is working properly. Students at all 
levels are taught to make "knee jerk" responses. The author once 
visited a ninth-grade algebra class in which the students were 
learning about formal proofs. The second line of every student proof 
was "given the domain of real numbers" — every student included 
that statement in every proof. The author asked the teacher if the 
concept of "unreal" numbers had been presented to the pupils 
(more precisely, an "unreal" number is called an imaginary number, 
usually symbolized by i = \/—l). She responded in the negative. 
The line in the proof had no meaning to the students. The "tap on 
the knee" was the second line of the proof; the "knee jerk" was the 
writing of "given the domain of real numbers." 

These things — concentration on trivia, concentration on "knee 
jerk" responses — do not need to happen. A detrimental effect is 
most likely to occur in cases where the school's decision makers 
decree that instructional programs should henceforth center on 
behavioral objectives without the absolutely necessary prior step 
of teaching the staff how to use behavioral objectives and caution- 
ing them about the misuses. Too often school administrators have a 
way of jumping on bandwagons without first finding out who is 
driving or where the vehicle is headed. Because the technique of 
stating outcomes behaviorally exists does not mean that this tech- 
nique must be used badly. 

Earlier in this chapter you were asked to contrast a high school 
speech class with an elementary school math program. Hopefully, 
this comparison of two different instructional settings suggested to 
you another cautionary note regarding behavioral objectives. The 
theme goes something like this: Even suggesting that goals can be 
translated to behavioral language equally well in all instructional 
areas is potentially harmful. Clearly the goals in literature and 
music appreciation are more difficult to state behaviorally than are 
those in mathematics or physics. If the people in each area are 
forced to be equally specific, some of the unhappy results expressed 
earlier (concentration on trivia, avoidance of opportunistic instruc- 
tion, for example) will occur. 



Behavioral Objectives 57 

The solution to the problem seems fairly clear. Accept the note 
of caution as a legitimate one. Admit that the ability to specify 
objectives behaviorally does vary with the subject area and level. 
Each group should, however, at least try to translate general objec- 
tives to behavioral language, and for two good reasons. First, even 
the speech and literature faculties will be surprised at the large 
number of general objectives which can be stated behaviorally. 
Second, attempts to state general objectives behaviorally will bring 
out into the open areas where people do not agree on the interpre- 
tation of a general statement. As mentioned earlier, only when 
issues are brought into the open can meaningful dialogue occur. All 
members of the speech faculty do not have to agree on the meaning 
of a "poised and self-confident" speaker. After discussing the issue, 
they can simply agree to disagree, and leave the resolution of the 
issue up to further study and research. The further study and re- 
search probably would never happen if the issue had not been 
clearly defined. 

Behavioral objectives are geared to what the student is doing. 
"What can we observe?" the writer asks. What one observes is one 
thing. How the information is processed by the mind is another. 
These stories should illustrate the point. 

A man walks into a tavern with an attractive woman. The two 
are obviously more than friends. The man's gaze never leaves the 
woman. If he had looked around he would have seen four other men 
in the tavern. Each of these men knows the man who has just 
entered. Each knows that the man is married. Each knows the 
attractive woman is not his wife. Suppose we read the thoughts of 
the four observers. 

Man 1 smiles inwardly as he thinks: Don't they look happy! I'm 
glad my friend has found happiness. A man deserves to search for 
happiness and is fortunate when he finds it. 

Man 2 is less positive: This is terrible! He should either make a 
break with his wife or not do things like this. I'll speak to him when 
I get a chance. I'm sure this ends our friendship. 

Man 3 looks and looks away. His eyes firmly riveted on the TV, 
he wipes his mind clear of the incident. He thinks: It's none of my 
business. I will not think of the incident again. 

Man 4 looks and thinks: I wonder what happened. They (the 
man and his wife) seemed so happy last time we were together. 
Has something happened to her? I'll have to make some inquiries. 



58 Educational Measurement 

All four observed the same event. The interpretations were dif- 
ferent, and so were the actions that the four would take after 
processirig the information in their minds. 

Ten-year-old Clint is asked to read aloud by his teacher. The gait 
is slow and uneven, a few words are mispronounced, and no expres- 
sion sneaks into the presentation — even at the end of a sentence. 
Four teachers observe and think. 

Teacher 1 thinks: I know this class well, and Clint is one of the 
poorest oral readers. 

Teacher 2 thinks: I know this class well and Clint is not as bad 
as some. Many read more poorly than he does. 

(Teachers 1 and 2 are norm-based evaluators. Each believes in 
inter-student measures. ) 

Teacher 3 thinks: Clint reads well enough to get by in this world. 
That's really the goal, you know. 

(Teacher 3 is a criterion-based evaluator. The performance of 
others is immaterial; it is comparison to some standard which 
counts.) 

Teacher 4 thinks: My, hasn't he made improvement since the 
last time I heard him. I should commend him on the progress. 

(Teacher 4 believes in evaluating progress. This teacher is an 
intra-student evaluator. ) 

In each of the two stories, four observers saw precisely the same 
event. They processed the information in different manners, and 
the decisions about any action dictated by the information were 
variable. 

"It is not the behavior which is of paramount importance," 
argues one group of critics, "it is the decisions one makes based on 
the observed behavior. Behavioral objectives," they continue, "will 
lead teachers to concentrate so hard on the behaviors that any 
progress we have made toward individualized instruction (that is, 
individualized decision making in the instructional program) will 
be lost." 

The unhappy outcome prophesied by these critics is not difficult 
to imagine. It must be avoided. A given set of complete objectives 
in elementary school mathematics could be handled by Student A 
easily within two years, whereas Student B might complete only 80 
percent in four years — and then with considerable difficulty. If the 
teacher feels that her responsibility to Student A has ended when 
he has fulfilled the objectives, then instruction is a long way from 
individualized. Not all students are likely to complete the same set 
or number of objectives. And identical observed behaviors on the 



Behavioral Objectives 59 

part of different students may not always lead to identical teacher 
decisions about those students. Elsewhere the author has argued 
that the decision-making process should stress intra-student 
(within student) change, rather than inter-student (between stu- 
dent) measurements.^ 

In summary, here are the cautions which seem to be most fre- 
quently sounded: 

Behavioral translations of general goals may be unfair to teachers 
who do not agree with the translations (and may be evaluated on 
the basis of them). 

Likewise, behavioral translations may not be fair to students who 
do not agree with the translations (but may be evaluated on the 
basis of them). 

Concentration on behavioral objectives may lead the teacher 
away from being opportunistic in the classroom, and may eliminate 
much serendipitous learning. 

Concentration on behavioral objectives will lead educators to 
concentrate on lower level outcomes and "knee jerk" responses at 
the expense of higher levels of cognition. 

Some subject matter areas are more amenable to behavioral 
translations than are others, and to demand equal specificity in all 
would be unfair and possibly quite detrimental. 

The whole idea is impractical and no single teacher can possibly 
do a major part of her teaching on the basis of behavioral objec- 
tives. 

Behavioral objectives stress performance when it is the inference 
from that performance which is important, and not so much the 
performance itself. 

A total outcome is greater than the sum of the single parts which 
were observed, knowledge is not as certain and absolute as behav- 
ioral objectives advocates imply, and knowledge is relative and 
personal. 



A Class Debate 

8. Organize a class debate around the topic of behavioral objectives. 
Choose three people for the "for" side, and three for the "against." 
Allow them at least a day for preparation. Each speaker is allowed 



2. John W. Wick and Donald L. Beggs, Evaluation for Decision Making 
in the Schools (Boston: Houghton Mifflin, 1971). 



60 Educational Measurement 

one opening statement (about three minutes) and one rebuttal after 
the other side has spoken (allow about two minutes per rebuttal). 
At the end of the presentations, place an extra chair with each team, 
and allow anyone in the room to join a team temporarily to make a 
particular point. At the end of the hour, have all those who were not 
on the two teams decide by secret ballot which team did a better job 
of presenting its case. 



Classroom Tests: 
A Survey of the Terrain 



Most teacher evaluation is not done with published tests. Some is 
done with tests written by the classroom teacher for use with one 
particular group of students. If you teach now, or intend to teach 
in the future, you will undoubtedly write some tests. This chapter 
and the two which follow are intended to guide you along the road 
to sensible classroom evaluation. The author will try to avoid the 
common trap of asking you to follow procedures that the average 
classroom teacher, for one reason or another, cannot or will not do. 

In this chapter the important elements of classroom evaluation 
are brought together into one conceptual scheme. The average 
classroom teacher rarely considers classroom evaluation in general. 
Instead, classroom evaluation is approached from the context of 
solving a particular and real evaluation problem — ranging all the 
way from making an instantaneous decision about which of two 
hand-waving students should be called on for the answer to a 
question to preparing a year-end test in arithmetic. Once you have 
become familiar with the general scheme, decisions regarding 
specific evaluation problems should be easier to make. Each par- 
ticular problem can be seen as part of a broader picture. The 
broader picture is the subject of this chapter. 

In chapter 5, a variety of item types wiU be presented to arm you 
with some of the weaponry of the trade. In chapter 6, examples of 
the most common classroom testing tasks will be presented to show 
you how these practical problems fit into the general framework to 

61 



62 Educational Measurement 

be presented in this chapter. For you, the practitioner, chapters 4 
and 5 should provide the map and the fuel; while chapter 6 provides 
help in actually reaching a particular destination. 



Comprehensiveness and Acceptability of the Objectives 

Evaluation is not purposeless. A teacher who undertakes the design 
of an evaluation device does so because there must be a need for 
such an evaluation. The general goal is to fulfill the need — to solve 
some problem. The specific goals and objectives may be translated 
to a long Ust in performance terms. They may be a wispy little 
sketch somewhere in the teacher's consciousness. 

As you (the teacher) conceptualize a particular problem in your 
classroom, two important issues must be considered at the earliest 
stages of deliberations. Your resolution of these two issues will be 
instrumental in determining the way your evaluation will eventu- 
ally be written, used, reported, and interpreted. The first issue is 
raised by this question: How detailed and comprehensive is the list 
of objectives for the measure? The second question was already 
introduced in chapters 2 and 3 : How acceptable is this list of objec- 
tives to the majority of teachers using the measure? That is, do the 
majority of teachers view the list of objectives as an acceptable 
representation of the major goals and objectives of the instructional 
program? 

The first decision point reflects the degree of specificity and ac- 
ceptability of objectives. Four distinct categories are suggested. 
Whenever you are faced with the task of preparing an evaluation 
device for your classroom, first determine which of these categories 
seems to fit your particular situation best. 

1. All of the major objectives to be covered by the evaluation 
are specified in performance terms and are measureable. In addi- 
tion, most people in the field (the teachers, administrators, and 
others involved in the instructional program) agree that this list of 
objectives does reflect the program's general goals and objectives. 

2. A serious attempt has been made to specify the goals and 
objectives of the program in performance terms. But while most 
members of the instructional team feel that the list is not exhaus- 
tive, the majority of those involved have generally agreed that a 
student who can perform at criterion level on the tasks specified is 
likely to have also reached the general goals and objectives of the 
instructional program. Agreement exists that the list is not exhaus- 



Classroom Tests: A Survey of the Terrain 63 

tive, yet it is still a fair representation of the performance objectives 
implied by the program's general goals and objectives. 

3. A serious attempt has been made to translate the program's 
goals and objectives to performance terms and, once again, most of 
the people involved in the instructional program agree that the 
statement of performance objectives is not exhaustive. Both of 
these statements are the same as in (2) above. Now, however, 
serious disagreement exists among the people involved with the 
instructional program regarding whether or not the objectives as 
stated really catch the flavor of the overall goals and objectives of 
the program. Even if a student could perform at criterion level on 
all the objectives stated, some people would still not be willing to 
certify that this student had completed the more general goals of 
the instructional program. 

4. No attempt — at least no serious attempt — is made to specify 
the objectives of the program in performance terms. Instead, the 
overall coverage of the evaluation device is defined. For example, 
the authors of the Graduate Record Exam or the Iowa Test of 
Educational Development specify the areas covered only in general 
terms. They do not attempt to list all of the performance objectives 
sampled by the measure. A classroom teacher who must write a 
comprehensive social studies test dealing with a 350-page book, 
seven films, and fourteen pamphlets will be hard-pressed to specify 
all of the objectives in performance terms. This teacher wiU un- 
doubtedly do what the test publishers do — define objectives in 
terms of overall coverage only. 

Reviewing, the categories range from (a) all objectives in per- 
formance terms, and everyone agrees on them to (b) everyone 
agreeing that the objectives as stated are not exhaustive but that 
they do catch the full flavor of the general goals to (c) a nonex- 
haustive list of objectives where there exists considerable disagree- 
ment as to whether the list reflects the overall goals to the case 
where (d) no serious attempt has been made toward translating 
the general goals of the evaluation into performance language. 
Given the variety of evaluation tasks faced by the classroom 
teacher, it is highly probable that over a period of time a situation 
will present itself for which each category is appropriate. An explo- 
ration of the relationship among the four, along with examples of 
each, seems necessary if this section is to function for you as a 
practical and useful device to improve classroom evaluation. 

An argument might be made that no realistic classroom evalua- 
tion situation can be included under the first category (or, to use 



64 Educational Measurement 

the terminology of the "new math," this category is an empty set). 
Can you envision a realistic classroom situation where all of the 
goals could be translated into performance terms? The requirement 
is quite restrictive, and any real application will by necessity in- 
volve a fairly clear and concise training program. General goals 
(good citizenship, reading for meaning, good facility with numbers) 
wUl obviously not apply, since it is virtually impossible to get 
general agreement on the precise meaning of each. Likewise, any 
program where the general goals are very future-oriented must be 
eliminated from this category. You cannot realistically expect to 
specify and measure all of its goals. For example, if one goal of a 
math program is that the student choose interest rates wisely when 
securing a loan, it is a future-oriented one. The goal cannot be 
measured until after the student leaves school. Although the goal 
can be stated in performance terms, the objectives are not easily 
measured if future performance is a key aspect of the program. 

Of course, in the general sense, all schooling can be construed to 
have a future orientation. The general goal of the school is some- 
thing like this: To prepare people to function as happy, productive, 
and satisfied members of our society — certainly a goal with a future 
orientation. Some goals are more future-oriented than others. A 
subtle difference exists. The culmination of the math goal (finding 
the best interest rate for a loan) will not occur until sometime in 
the future. This is clearly future-oriented. On the other hand, the 
goal of using proper grammar is also pursued with an implied future 
orientation; but the act of actually using good grarrmiar can happen 
immediately following instruction. The teacher can observe the 
grammar in a testing situation as well as in any written material 
prepared by the student. Thus, it appears that even though nearly 
all of what happens in the school is, in general, future-oriented, this 
does not negate the possibility that some general objectives might 
be translated completely to performance terms and be immediately 
measureable. 

How could a general area like grammar fit into category 1? 
Grammar might be loosely defined as a normative system of rules 
in language which are used for the acceptable construction of sen- 
tences. To be sure, grammar books can be thick, but the entire set 
could he phrased in performance language with a reasonable cri- 
terion level established for each. The general goal would be that 
the student use grammar properly (that is, at the criterion level or 
above) at the present time. The entire general goal could be 
translated into performance language. 



Classroom Tests: A Survey of the Terrain 65 

Other examples include training programs in accounting or 
typing, or a program designed to promote safe use of chemicals 
in the laboratory. The category wherein the general goals of a pro- 
gram are completely specified in performance terms is not an empty 
category. Realistic examples do exist. 

The goal here is that you will be able to place each evaluation 
problem facing you in one and only one of the four categories. If 
this is to be realized, then the distinctions among the four must 
be very clear. How does the first category (complete specification of 
all goals in performance language) differ from the second (an 
attempt at specification in performance language which falls short 
— ^but most members of the instructional team feel that the objec- 
tives as specified are probably satisfactory indicators of the desired 
performance)? 

The second category will include those cases where the measure- 
ment of a performance objective cannot occur until sometime in 
the future. As was pointed out, the performance objectives for 
grammar are for the future, but can be checked immediately. When 
the objectives cannot legitimately be checked until some time in the 
future, category 2 is required. Most mathematical training will 
come under this general heading. A student first learns the number 
facts like 4 + 6 = 10 because these facts are necessary for a 
functioning citizen, but also because they are requisite skills to 
adding 46 + 24 and other important mathematical learning. The 
general goal includes learning the fact for its own sake, but another 
aspect of this general goal is linked to aiding future learning of more 
complex skills. You can measure immediately to see if the student 
remembers the fact (4 + 6 = 10). However, a measure of whether 
or not the skill can be properly appUed at a later date cannot be 
obtained until the need appears. 

A second specific kind of example fits into this category of future 
measurement and flexibility. It is the "that's not exactly what I 
mean, but it's close enough" kind of translations. To illustrate, 
consider a general objective of a national law-focused social studies 
curriculum program r^ 

Probably the most important reason for teaching Supreme Court 
cases at this time in our national history is that it can sustain the 
faith of yoimg Americans in our Judicial System. It can demon- 



1. The Law in American Society Foundation, based in Chicago, Dr. Robert 
H. Ratcliffe, Executive Director. 



66 Educational Measurement 

strate how that system should function so that students will be 
aware of its malfunctions. It can motivate students to seek more 
justice from our courts and other instruments of government.^ 

What are the key parts of this general statement to which per- 
formance objectives must be attached? The paragraph is repeated 
below with these key elements italicized. 

Probably the most important reason for (1) teaching Supreme 
Court cases at this time in our national history is that it can 
(2) sustain the faith of young Americans in our Judicial System. 
It can (3) demonstrate how the system should function so that 
(4) students will be aware of its malfunctions. It can (5) motivate 
students to seek more justice from our courts and (6) other 
instruments of government. 

This book is not the place to develop the entire series of per- 
formance objectives which should be tied to the paragraph. The list 
will be lengthy, and should be assembled only after extensive 
deliberations by those charged with leading the program. However, 
suppose a list something like the following was assembled: 

1. The content for the program is specified. The staff could 
assemble lists of specific cases; designate the information which 
should be retained; and prescribe the relationships which are con- 
sidered important for the student to understand. What information 
from the Dred Scott case is important to recall or recognize? How 
did the decision affect the happenings of that era? How does it 
relate to what is happening today? Such relationships could be 
specified in performance terms. It is likely that a consensus could 
be reached by the project staff that the list "catches the spirit" of 
the general objective. 

2. "Sustain the faith of young Americans in our Judicial Sys- 
tem." That's a mean one to translate. Examples of "sustained 
faith" could be assembled. A short movie depicting a hypothetical 
case of a citizen being treated unjustly might be presented to the 
student. He would then be asked to indicate how the case should 
be decided based on the law (an information question). He might 
also be asked to predict how he thinks the case will come out 
(which would indicate how much faith he has in the judicial sys- 
tem). A list of such examples, once again tied to the materials in 

2. From the Teachers Guide to GREAT CASES OF THE SUPREME 
COURT (Boston: Houghton Mifflin, 1971), p. 3. 



Classroom Tests: A Survey of the Terrain 67 

the program, could be assembled. The staff probably could be 
brought to the general consensus that while the list may not 
exhaust the general objective completely, it does "catch the spirit" 
of measuring "sustained faith" of young Americans in our judicial 
system. 

You might feel at this point that the real "spirit" of seeing young 
Americans manifesting "sustained faith" can occur only as the 
young people reach maturity and take their places as adult citizens. 
Will they, at that time, attempt to have their grievances amelio- 
rated through the system or will they go outside the prescribed 
channels? Will their actions at that time indicate faith in the 
judicial system? If this is the way you feel, the objective moves 
from category 2 to category 3. Category 3, you recall, includes those 
cases where a serious attempt has been made to translate general 
objectives to performance terms, but where some feel the list has 
not caught the real flavor of the general goal. Some are not willing 
to say that the list "catches the spirit" of the objective. They 
would be arguing that something serious is missing. 

The distinction between categories 2 and 3, then, is an individual 
one. One person — or perhaps even the majority of people in charge 
of the project or instruction — may feel that the hst of performance 
objectives does capture the spirit of the general statement and 
place it in category 2. They would feel that any student reaching 
criterion level on the performance objectives has satisfactorily 
mastered the general goal. If you disagree in the manner described 
above, you would place the objective in category 3. The decision is 
an important one. As you will see presently, the placement will 
affect the manner in which results will be reported, as well as the 
criterion you use to attach a value statement to the student's 
performance. For example, if you are quite sure that the particular 
performance objectives do capture the spirit of the general objec- 
tive, you would feel justified in criticizing a student who has not 
reached criterion level. But if you seriously question the value of 
the performance objectives, you would be less willing to penalize a 
student simply because he does not agree with you and manifests 
a different kind of performance. Recall the two cases cited in chap- 
ter 2 — the case of the performance objectives in speech contrasted 
with the performance objectives in arithmetic. 

The one-paragraph goal of the law-focused social studies pro- 
gram is one of the general goals of this program, and the translation 
of two parts of this general goal into performance terms is just a 
beginning. The entire task, including serious discussions about 



68 Educational Measurement 

what the general terms actually mean, would require a substantial 
commitment of time. EarUer, the author reported his experiences 
in writing objectives for an elementary school science program, 
making the point that the task is a very difficult and time-consum- 
ing one. These two examples lead to the following general state- 
ment: 

If you, charged with the task of directing the learning of some 
students, do not have available a prepared list of performance 
objectives, or are not given substantial periods of preparation time 
to construct such a list, then your evaluation program is probably 
going to be assigned to category 4. It will be a case where no serious 
attempt has been made to translate the general objectives to per- 
formance terms. Instead, the evaluation will be defined in terms of 
overall coverage. 

No teacher need hang his head over that statement. The people 
who should feel chastised are the ones who failed to compile or 
secure such a hst for the teacher. Anyone who has taught, rather as 
a regular job, as a substitute, or as a practice teacher, is aware 
of the demanding nature of the task. Find a good teacher and you 
probably have found someone working far in excess of a forty-hour 
work week. If you are the teacher, and unless you are incredibly 
efficient or have extremely high motivation, you probably will not 
have the time to translate all of the general objectives in your 
instructional program to performance terms. 

Where, then, do the lists of performance objectives come from? 
A number of sources already exist, and others could be nurtured by 
pressure from within the educational community. Consider the 
following: 

1. Some school districts have assigned teacher groups the task 
of translating the major general goals into lists of performance 
objectives. Often this task is undertaken during the summer 
months. In any event, almost all districts have some sort of cur- 
riculum guide. Check the degree of specificity of the objectives in 
this document. If the objectives in your district have not been 
translated into performance terms (to the degree that this is possi- 
ble) you might begin agitating for the implementation of a writing 
program during the next summer period. 

2. As a starting point for such a writing program, the officials 
in your district might seek out one of the objectives depositories 
which are forming in this country. At an objectives depository a 
whole bank of performance objectives has been collected and stored 



Classroom Tests: A Survey of the Terrain 69 

under specific headings. You might feel that using something so 
"ready-made" smacks of a controlled and somewhat sterile society. 
View this pool of prewritten objectives as only a starting place for 
your district. Select from the list those objectives which fit the 
philosophy of the home group. The task of selecting objectives 
from a list written by others — ^filling in some original ones as needed 
— is far more efficient and effective than having each district "start 
from scratch."* 

3. Some textbook pubhshers are now providing performance 
objectives in textbook supplements. While this movement is not yet 
widespread, school administrators could accelerate the pace by 
demanding such supplements before purchasing any new books. 
You, as a teacher in a district, could let your wishes on this matter 
be known to the administration. 

If you are faced with a classroom evaluation problem where the 
general goals have not been translated to performance terms you 
need not necessarily relegate the problem to category 4. The inten- 
tion of the preceding discussion was not to discourage you from a 
personal attempt at the translation. Any serious attempt that you 
make in this direction, even if it turns out to be just a beginning, 
will aid both the instructional and evaluation tasks; for generally, 
as one attempts to sharpen the evaluation focus, the instructional 
focus improves as well. Before you start the task, however, look for 
help — an objectives depository, your district's curriculum guide, or 
the teachers' supplement to your textbook. 



Questions 

1. Think of two additional examples of testing situations which you 
believe can be appropriately assigned to each of the four categories 
listed imder "Comprehensiveness and Acceptability of the Measure." 

2. Which of these would more likely be assigned to the first category: 

a. A test to measure the acquisition of phonic skills for fourth 
graders. 

b. A test to measure the ability of fourth graders to "make change." 
Explain your choice. 

3. These depositories are likely to exist in places like consortiums of school 
districts, schools of education, or large city school districts. Two well-known 
depositories are the Instruction Objectives Exchange under the direction of 
Professor W. James Popham of UCLA, and the Laboratory of Educational 
Research at the University of Colorado, where a pool of measures in the 
affective domain is in operation. 



70 Educational Measurement 

3. Think of a particular testing situation. Explain why, in this situation, 
some teachers would place the test in category 2, and others would 
see it as belonging in category 4. 



The Comprehensiveness of the Measure 

Remember that the goal of this chapter is to give you a general but 
practical framework wherein you can place each specific evaluation 
problem which faces you in the classroom. In the next chapter, the 
tools of the trade will be introduced, and in chapter 6 a series of 
examples presented. After you have conceptualized the evaluation 
problem facing you so that it fits satisfactorily into one of the four 
categories, the next task is to decide how comprehensive the mea- 
sure is to be. In considering the comprehensiveness question, three 
distinct possibilities exist: 

1. You could design some evaluation task for every single objec- 
tive. Much as a pilot checks every single gauge and indicator before 
taxiing onto the runway, you would be checking every single objec- 
tive before allowing the student to move on to another task. 

2. A sampling process could be introduced. Perhaps it is simply 
too time consuming for both you and your student to measure each 
objective. If you sample in a random manner from the total domain 
of objectives, a good indicator of the student's overall mastery wiU 
be provided. 

3. A biased sampling process can be undertaken. Most classroom 
tests are probably prepared in this manner. In this process, you 
survey the total coverage to be evaluated and select the most rele- 
vant, and/or thought-provoking topics for inclusion. 

The first category wherein every objective is evaluated in some 
manner, seems fairly clear. Assuming the objectives are specified in 
performance terms, why should the second, with its random sam- 
pling, even be conceptualized? Why not just do an exhaustive 
testing of all objectives? To answer these questions, consider these 
objectives — which are important ones for an elementary school 
math program: 

Given arrays of money (items chosen from permy, nickel, dime, 
quarter, half dollar, or dollar bill) student shall write the total 
amoimt of the array in the form $0.00 and 0000. 



Classroom Tests: A Survey of the Terrain 71 

Given a time (involving one minute intervals and the concepts of 
A.M. and P.M.) the student shall compute an earlier or later time as 
directed by a printed problem. 

Think for a minute about the implications of a one-to-one relation- 
ship between objective and measure with the first objective. Here 
is the beginning of an exhaustive list of measurement items which 
should be included: 

Using just one coin: 6 p, 6 n, 6 d, 6 q, 6 hd, 6 dol (here a total 
of six coins in the array has been assumed). 

Using two coins: 6 p, 1 n; 4 p, 2 n; 3 p, 3 n; 2 p, 4 n; 1 p, 5 n 
(repeat for combination p,d; p,q; p,hd; p,dol; n,d; n,q; n,hd; n,dol; 
d,q; d,hd; d,dol; q,hd; q,dol; hd,dol). 

Using three coins: 4 p, 1 n, 1 d; 1 p, 4 n, 1 d; 1 p, 1 n, 4 d; 3 p, 
2 n, 1 d; 3 p, 1 n, 2 d; 2 p, 3 n, 1 d; 2 p, 1 n, 3 d; 1 p, 2 n, 3 d; 
1 p, 3 n, 2 d; 2 p, 2 n, 2 d (repeat for all other combinations of 
three: p,n,q; p,n,hd; p,n,dol; p,d,q; p,d,hd; p,d,dol; p,q,hd; p,q,dol; 
p,hd,dol; n,d,q; n,d,hd; n,d,dol; n,q,hd; n,q,dol; n,hd,dol; d,q,hd; 
d,q,dol; d,hd,dol; q,hd,dol). 

Now the arrays for using four coins, five coins, and six coins will 
have to be displayed. To build a test based on random selection, 
you would sample every tenth, twelfth, fifteenth, or whatever — the 
frequency of the sampling depends on how long you want the test 
to be. 

Arraying all possible combinations is a job. It would be far easier 
to take a die (one from a pair of dice) and paint "p," "n," "d," "q," 
"hd," and "dol" on the six sides. Then if you want a thirty-item 
test, you roll the die six times for the array to be included in each 
problem. Such a procedure would also be random. 

The second objective will require questions such as this: 

It is 12:33 a.m. and Tom has been working 71/2 hours. What time 
did he get to work? 

Since an infinite number of applications of this objective can be 
devised, a complete testing is impossible. Again, though, a random 
sampling of possible items could be carried out. Suppose you want 
to focus on these concepts within the objective: 

1. Time changes within 12 midnight to noon (a.m. only) or noon 
to 12 midnight (p.m. only). 

2. Time changes forward (a.m. forward to p.m. or p.m. forward 
to A.M.) and backward (a.m. to p.m. and p.m. to a.m.). 



72 



Educational Measurement 



3. Use of certain words (" hours earlier,' 



hours ago," 



" hours since he arrived," " hours after," plus others) and 

use of time as whole hours, fractional hours, as well as hours and 
minutes. 

4. Amount of change ranging from 30 minutes to 30 hours in one 
minute intervals. 

A table of specifications similar to table 1 could be devised. To 
determine if the time is to be expressed as minutes or as a fraction, 
flip a coin. To determine the actual numbers to be used in the 
problems, choose them from a random number table (almost any 
statistics book will have one). 





within 
noon to 
midnight 


within 
midnight 
to noon 


FORWARD 


BACKWARD 




A.M.- 
P.M. 


P.M.- 
A.M. 


A.M.- 
P.M. 


P.M.- 

A.M. 


"hrs. earlier" 


6 


10 


18 


3 


22 


9 


"hrs. ago" 


11 


17 


2 


21 


23 


7 


"hrs. since" 


15 


1 


20 


25 


24 


29 


"hrs. after" 


5 


14 


13 


4 


26 


8 




16 


12 


19 


27 


28 


30 



(The numbers in the cells are the item numbers for the test) 

Table 1. 



If it turns out that the left hand column is very long because you 
are attempting to assess a long list of terms, you might want to do 
some item sampling. That is, you would not have an item in each 
and every cell. Instead, you might just choose (at random) every 
other cell or every third cell. 

Why go to all that trouble? Why not just select items as they pop 
into your head? The answer is that such a selection process may not 
really cover the entire domain because, without meaning to be, your 
selection process will be biased. You may overdo the easy problems 
or the difficult ones; choose too many examples with three or four 
coins; overemphasize the half dollars or dimes; forget certain key 
concepts to be tested; or any of a variety of things. With a system- 
atic sampling process you will assure yourself of "even" coverage. 

Whenever the general goals have been completely translated to 
performance terms, it should be relatively easy to define the domain 
of possible questions and sample systematically from it. If the 



Classroom Tests: A Survey of the Terrain 



73 



evaluation problem belongs in categories 2 or 3 (an attempt has 
been made, at least, to translate the general objectives to per- 
formance terms) the task of defining the domain of possible ques- 
tions and sampling systematically may be a little more difficult. 
When it is possible to sample in the manner illustrated above, you 
should do so. 

Sometimes the task is pretty overwhelming — ^virtually impossi- 
ble. What do you do then? Well, you can still do a modification of 
a systematic sampling scheme. Suppose, for example, you need to 
plan some sort of evaluation process for two chapters covering 40 
pages in a book (20 pages in each chapter). In addition, you decide 
that about 10 percent of the questions should force the student to 
summarize all the material; 10 percent of the questions should be 
summary questions from the first chapter and 10 percent summary 
questions from the second; and the other 70 percent spread some- 
what evenly over the 40 pages — fairly specific questions.* 

To make the example come out even, assume a twenty item test. 
Figure 1 shows how you could systematically sample from the 40 

Page 



123 4 56 7 89 012 345 678 901 234 567 890 123 456 7890 

5 6 7 8 9 10 11 12 13 14 
Items 17, 18 



Item 
1 2 



3 4 
Items 15, 16 



Items 19, 20 

Figure 1 

pages. Now you can't expect to conform specifically to the chart — 
since it could happen that page 20 is nothing but a chart, and page 
33 has four pictures on it. But you would at least try to stick to the 
plan. The alternative is to leaf through the book and construct 
items as thoughts "hit you" — ^which allows for all of the biases in 
selection mentioned earUer. Obviously, the technique is only the 



4. We hope you'll avoid conceptualizing this as a true-false, multiple 
choice, essay, or some other specific kind of test. In the next chapter, we hope 
to make the point that there are a variety of manners for evaluating an 
objective. These include paper-and-pencil tests, to be sure, but also include 
performance tests, oral measures, unobtrusive observations, and other tech- 
niques. We're trying to avoid talking about specific kinds of items until the 
next chapter — which is why we constantly use "evaluation device" or 
"measure" in place of the word "test." 



74 Educational Measurement 

beginning of a random process. It merely gets you focused on the 
different information and concepts which happen to be on the vari- 
ous pages. You still must select the particular questions to ask, and 
these can range all the way from a straight recall-based question to 
a question asking for a complicated analysis of the concept. 



Questions 

4. Just to illustrate the technique, construct an actual 38-item test built 
around the coin problem presented in this section. Explain each of 
your steps so that another student will understand the process. 

5. Go to the library and find a junior high school social studies text- 
book. Select the fourth chapter in the book for a hypothetical test in 
which three items will be over the whole chapter, and seven items 
tied fairly specifically to smaller numbers of pages. Do these steps: 

a. Divide the niunber of pages in the chapter by 7 (for the seven 
specific items) . Divide the chapter into seven segments according 
to this outcome. 

b. Make seven brief topical/ content/ concept summaries for the seven 
parts of the chapter. 

c. Choose from each the most important information, content, con- 
cept, or an application thereof. 

d. Write a question about the topic you chose for each segment. 

e. Write three comprehensive questions over the chapter. 



Table 2 shows the relationship between the test categories and 
the comprehensiveness of the measures. The table is fairly self- 
explanatory. The category into which you place your particular 
evaluation problem is important, for the later decision about com- 
prehensiveness is partially dependent on the earlier categorization. 
The first two categories strongly suggest an evaluation scheme in- 
volving either complete or random measurement of objectives. A 
biased testing scheme would defeat the purpose of completely 
specifying objectives in performance terms. In the case where 
serious disagreement exists over the coverage of the objectives, 
either a random or biased sample applies — the decision depends on 
how you view the matter. If you are one of those who disagrees 
with the list as specified, you will undoubtedly opt for a biased 
sample. Include in the sample those objectives which are in har- 









I 
o 
O 



o 

Qo 

<4) 
s^ 

e 
O 

CO 

e 

<a 
2 

e 
o 

e 
"■■a 

0:; 



■*^ CO **^— J 

-S - ^ 

1:11 



(U m 0) 0) "^ 

en ■■-• +j tn fl) 




» « J, "* o 



I^ § " § ° 

o o a a O 
^■2 si's 

«-, a « 5 "> 

"w S S 2 

-2 3 ^"5 
« H O <" 

5) w J- g 

!> +s e O 








"a 


s 


E 


■iS 


to 




m 


v 






s 







"& 


T3 





=5 





C« 



t>0 



o 



-M OJ CO B 

,^ K a) ra H 
>■$-"- o 
5 i a; B-H 
S .S o « 
3 to -a.y 

43 P n.H 



1.2 




TOO 

.2 3 

Ooig 

■^ o^ 
13 -§ "" 

aj I— t 
.w 3 to 

« O C 

■g >>;& 
K.y o 

if! ^-^=5 

*jr» en • " 
o > o 

)— I O CO 



■S.2-2 

p g 0) 

M ? M 
ft O ft 

03 d 0) 

■fj r! H 

ft P><H 

8 "J 

■a Ob 

I"! 

^J2 

.a« 3 

X-- o 

»g >> 

a^2 

gg.^ 
S (>.E-i 

CO 43 CO 



fi <i).2 

SP-B,C 



>i B ft 



li 

O Q 

2 o 

■SB 

•" 3 
o o 

tu <u 
> o 

8* 

to o 

ll 

4^ CO 

0) u 

0" 



0«*-i 

^g 

8 s 



+^ ft 

ftco 

to o 

(V 



(U 



u 



ft,2 o 






£-' B 
£■0.2 
■* S-g 

2 B « 

B > B 
O « B 

1^1 



« 



O CO 
.4^ M 



cu 


ft 


ft 


"ft 


a 


B 


i 


to 
to 


s 


CO 


& 


a 


T3 








0) 


T3 


T3 


CO 

.2 




§ 


n 


Ph 


K 



2o 

5-B 
o 

-tJ 
3 « 
o >- 

^ CO 
CO 

to c 
Mffl 



•3 O 
a ft 



■SS § 



o 

8S 

8 

* _ 

B " 
> h 
5 <" 

t4*ai 
3 B 
O O 






" ft 
-« 

to g 

B '" I 

5 o { 



ftp 
O B 



XI 



PQ 






^ B 0) ft 
•" o era 

5 M 9^ cDX 
« a £ ft* 

■♦r ^i ho j3 

j° O CO 0) o 

2«-i to+3 

Sl-S O CO 

(N ft^*3 = 



E B 

0) o 

S ?> B 

^ B (1) 

o o bo 

Zt: cs 



CO ra 

■* ft o 



75 



76 Educational Measurement 

mony with your own personal philosophy — the ones you think are 
important. In disagreement cases where you cannot settle on your 
own personal viewpoint, a random sample seems more appropriate. 

Finally, if the category of your test is the fourth one, then obvi- 
ously some type of sampling will be required. A true random 
sampling technique is impossible, for such an approach requires the 
complete specification of the domain of possible questions — and 
this is precisely what you do not have in category 4. But a 
systematic selection procedure could be followed, ensuring that 
your measure does survey the entire area of content in a fairly even 
manner. Clearly such a technique is not totally random, for your 
own biases enter into the topics chosen and the manner of stating 
questions. 

The fact that the sampling of evaluation items involves personal 
bias is not to be construed necessarily as bad or reprehensible. To 
be sure, sensitive teachers have followed precisely this procedure in 
a very successful manner. Not all who teach are sensitive, however. 
All people have some biases — you have some too. When your 
selection is not random you cannot avoid including these personal 
feelings. While it is possible that these biased selections of topics 
indicate your sensitivity, it is also possible that your selection may 
be unreasonable or unfair. A biased selection of items is not inher- 
ently wrong. The danger comes when you close your mind to the 
fact that you control the selection of topics and questions. Thus, 
the test results are also under your control. And there does exist a 
distinct possibility that your current biases might — ^just might — ^be 
incorrect. 



Test Administration Decisions: 
Who is to Be Evaluated? When? 

Assume now that you've decided which category is most appropri- 
ate for your measure and how comprehensive the measure is to be. 
The next decision is a minor administrative one deahng with the 
evaluation sequence. Presumably the measure was constructed to 
be used with some students. How shall you set up the administra- 
tion of the test? 

Obviously you set aside some time during a class period, an- 
nounce this time in advance to the group, and administer the 
measure (call it a test if you must) to all at the same time. Right? 



A 


B 


C 


D 



Classroom Tests: A Survey of the Terrain 77 

Table 3 
Test Administrative Decisions 

Time of Testing 

All evaluated at Time varies; individual 

the same time student is evaluated 

Content of when ready 
the Measure 

Same items for all 
of the students 

Items vary among 
individual students 



Not necessarily. Consider these four possibilities in table 3. Which 
cell will you choose for your measure? The decision you make is a 
very important one, since it will affect the manner in which you 
construct the measure and report the results. The expectations 
you have for the students and, in a certain way, your general edu- 
cational philosophy, are reflected in your choice. Consider the four 
categories a little more deeply: 

A. Same items administered to all of the students at the same 
time — ^probably still the most common testing sequence in the 
schools. You have had plenty of experience with it. Assuming that 
the instructional technique is constant for all members (same mate- 
rials, same amount of time on task, same technique), it probably 
follows that the scores on the measure will be variable. A proportion 
of students score high, others low, and the rest somewhere in 
between. Some say variations in performance are due primarily to 
time on task.^ That is, aptitude is not defined as something innate, 
but rather as the amount of time it takes a student to master a 
concept. Others might argue, albeit implicitly, that the variation in 
performance is due to innate aptitude or variations in environ- 
mental pressures. This book is not the place for a lengthy treatise 
on these topics, but you must keep this point in mind: Testing in 
this manner will lead to variations in performance. Is that what you 
want? Are you prepared for the outcome that some students will 
have mastered the material much better than others? Are you 



5. Benjamin S. Bloom, J. Thomas Hastings, George F. Madaus, Handbook 
of Formative and Summative Evaluation of Student Learning (New York: 
McGraw Hill, 1971), chapter 3. 



78 Educational Measurement 

prepared to explain the individual differences to the students and 
their parents? Will you keep instructing the lower performers until 
they reach higher levels on the evaluation, or will you go on to the 
next topic, reahzing that variations exist in the level of learning for 
the current concepts? 

It was good enough for my Grampa and it was good enough for 
my Dad so why isn't it good enough for my students? Right? Look 
at the next category: 

B. The same measurement items are used for all the students, 
but the time of administration varies. Presumably this means that 
the student is allowed time to master a concept before the measure 
is administered. What are the implications of the use of this ad- 
ministration sequence — implications for the construction of the 
measure, the expectations you have, and the manner in which you 
report? Consider these: 

1. The bookkeeping and storage problems can become enormous, 
assuming you follow this procedure in all areas of the curriculum. 
On Tuesday, Charlie is ready for Test I, Pete for Test II, and Sam 
for Social Studies measure X. Wednesday, Mary and Helen are 
ready for Test II, and Pete gets around to Social Studies measure 
X. Every day you're administering the same measures to different 
people. Some school systems are moving into computer-managed 
instruction, where the record keeping, prescribing, and test scoring 
are handled by the Machine, but it cannot be assumed that your 
school has — or soon will have — such a system. Administering the 
test as the individual student becomes ready for it is certainly 
sensitive to individual difference, and that is a commendable objec- 
tive, but it makes your life a little more difficult. 

2. Then there is the small problem of test security. If Pete and 
CharUe are friends, and Pete sees the test on Tuesday but Charlie 
won't be ready until Thursday — don't you think some communica- 
tion will take place? Performance on the measure can begin to be a 
function of social ability — ^the more friends you have who learn 
more quickly and therefore see the test before you, the better you 
will do. Under certain conditions, however, "learning for the test" 
is not a reprehensible sort of thing. It depends on how comprehen- 
sive the measure really is. 

Recall the three categories of test comprehensiveness: You can 
have a part of your measure tied to each of the objectives; you can 
randomly sample from the objectives; or you can select items for 
the measure which reflect your biases. // you have a measure which 
reflects all of the objectives, then "learning for the test" (or "teach- 



Classroom Tests: A Survey of the Terrain 



79 



ing for the test") is a very good thing. The student who "learns 
for the test" is busily engaged in fulfilling all of the objectives. 

On the other hand, if the measure reflects some sort of sampUng 
(random or biased) from the total domain of content to be covered, 
then test security is important. Figure 2 shows what you probably 



Phis space includes 
all of the performance 
objectives that you 
decide the student 
should master 




This is a 
random or 
biased sample 
from the 
domain 




Reasoning: if the 
student does well on 
the sample, he had to 
reach approximately 
that performance level 
on all objectives 







Figure 2 

want. However, if the student knows in advance what the items are 
to be, the reasoning part must change. Now, the only objectives you 
can assume the student has mastered are the objectives which he 
knew were going to be in the sample. You cannot assume he has 
paid the slightest bit of attention to the remainder — those objec- 
tives which he knew would not be a part of the sample. Whether 
or not the situation is serious depends on a variety of things. Some- 
times it is impossible to learn the sample without learning the 
entire domain. For example, a student who can master ten specific 
multiphcation problems can probably master ten others. Also, if 
the students have developed a philosophy focused on reaching 
objectives rather than mastering a test, then test security is not a 
problem. The test security question does not necessarily make this 
testing sequence impossible, but it does deserve some thought. 

3. Using the same measure with all students but at different 
times says something about both your expectations for the students 
and the manner in which you will report the results. If you are 
waiting for the students to master some concept, you probably are 
not interested in score distributions. That is, you have established 
some sort of desired performance level, and you expect the student 
to reach this performance level before moving on. You expect each 
student to reach mastery. You do not expect a distribution in per- 
formance levels — only in time on task. The manner in which you 
report results should also reflect this philosophy. Stanines, grades, 
standard scores, and percentiles are unsatisfactory if the testing is 
mastery-based. You would report using some sort of a checklist 
showing which objectives had been mastered, and which objectives 
had not yet been completed. 



80 Educational Measurement 

C. The sequence wherein all students are evaluated at the same 
time, but with items which vary as a function of individual differ- 
ences, is a less common one in the classroom. In category B the 
time on task was allowed to vary but the test items were constant. 
In this category, the time on task as well as the elements of the 
instructional program are constant for all students. Given that 
individual students will, for one reason or another, progress at 
different rates, it follows that the degree to which the objectives are 
mastered will range from complete mastery by some to considerably 
less mastery by others. Rather than administer the same test to 
all with the sure outcome that a range of scores would result, the 
test items could be allowed to vary, reflecting your predictions 
based on previous results about how much each individual has 
mastered. Such a procedure is called "in-level" testing. Although it 
is difficult with classroom measures, it is to be recommended with 
standardized tests. Although the major topic of this chapter is tests 
constructed by the teacher for the classroom, this seems like an 
appropriate place to digress for a moment on the topic of "in-level" 
testing with standardized measures. 



In-level Testing 

"In-level" testing simply means you administer to each individual a 
test which is most appropriate to his current performance level. If 
you are teaching third grade, for example, the students in your 
room will manifest a variety of reading levels. If a child is reading 
at approximately the first-grade level, you would not administer a 
third-grade test to him. Instead he would be assigned to complete 
the first-grade form. For a student reading at third- or sixth-grade 
level, you administer the form most fitting to his performance level. 
The performance levels are determined by your knowledge of the 
student — either informally through classwork, or more formally 
through a previous achievement test. 

Why "in-level" testing? Simply because the scores from almost 
any measure are most reliable at mid-range and least reliable at the 
extremes. This entire concept will be discussed later in chapter 8, 
but consider a child who is in third grade but reading at a fifth- or 
sixth-grade level. This child takes a third-grade test and virtually 
"aces" it — scores almost all of the items correct. That's fine; but 
the child really didn't have a chance to show you how good he 
really could be if challenged. In addition, when standardized tests 



Classroom Tests: A Survey of the Terrain 81 

are normed and revised, only a small proportion of the norm group 
actually scores at either the high or low extreme. Most of the norm 
group scores near the middle. Thus, the scores at the extremes are 
far less stable — far less reliable — than are the scores near the 
middle. In-level testing implies that you select a test with items 
neither too easy nor too hard so that each student responds to 
items which are most appropriate to him. 

In-level testing is probably a sequence you would not often select 
for teacher-made, classroom tests. But this concept is something 
which should be kept in mind in the administration of standardized 
measures. 

D. Students are measured at varying times, and the items in the 
measurement device vary also. Two major purposes could be served 
by scheduling your testing in this manner. First, the problem of test 
security can be overcome in cases where the items in the measure- 
ment device represent a sample from some larger domain of content. 
Second, you can adjust for individual differences, trying to make 
sure that all students will experience a modicum of success, regard- 
less of past performance levels. Consider the purpose of test secur- 
ity first. 

The purpose of changing the items is to insure that the student 
still views the measure as a random sample — a random and un- 
known sample. You probably won't have to change all of the items. 
If it becomes known that as little as one-third of the items always 
change between administrations, "learning for the test" will lead 
only to unhappy consequences. 

One problem with changing the items slightly is that it is often 
difficult to replace each item by another item of equal difficulty. 
Test publishers can do a pretty good job of this when constructing 
alternate forms, but they have data available on difficulty levels of 
alternate items. When building a classroom test, you probably 
won't have this kind of technical data available and will have to go 
on your own "feel" for the items. If you ask for the reasoning of 
the Supreme Court on the United States v. Wong Kim Ark on one 
form, and replace it with Trop v. Dulles on another — ^both cases 
involve a Supreme Court decision on a citizenship question. Math 
or science problems are easy to change, requiring only that you 
alter a few numbers. Altering passages involving spelling, vocabu- 
lary, grammar, or paragraph interpretation likewise would not be a 
serious problem. 

Remember that you won't have to change the items every time 
you administer a test to an individual. After all, the notion of 



82 Educational Measurement 

"individualized instruction" never really was to be interpreted as 
"solo instruction" — each student working completely alone. You 
undoubtedly will have groups of students who will naturally 
progress through a program at about the same rate. If five or six 
students are measured on the same day, you can use the same 
measure for them. If another student is measured some time later, 
you could reuse the same measure with him — assuming you have 
built a tradition wherein the students have learned to expect that 
the measures will be changed. 

So if you are testing students at different times in order to allow 
them to move through a learning program at their own rates, you 
can solve the problem of test security by altering a fraction of the 
items for forms administered on different days. Return now to the 
idea of reflecting differing performance levels by changing the items 
in the test. 

If you are an advocate of the interpretation that aptitude is 
really defined by "time on task," then it follows that you believe 
all the normal students can reach the performance objectives if 
they are given enough time. That seems like a satisfactory philos- 
ophy, but it raises an obvious practical difficulty. You do not have 
an unlimited amount of time with each student. Eventually de- 
cisions must be made moving the student from a task before he has 
mastered it simply because another task must also receive atten- 
tion. Eventually some of your students will be accomplishing more 
of the performance objectives than others. Given the range of 
individual differences one usually finds in any group of students, it 
seems difficult to imagine a classroom where this is not so. 

Why not administer the same measure, or somewhat altered 
forms of the same measure, to each student regardless of the num- 
ber of performance objectives you believe have been mastered? 
This will lead to a range of scores. You can report that John com- 
pleted 40 percent of the objectives, Harry finished 55 percent, Mary 
completed 82 percent, and Suzie finished 96 percent. John and 
Harry aren't going to feel good about the situation, although maybe 
that's all right. They're probably used to it. On the other hand, 
why keep kicking them in the teeth? You have a pretty good idea 
how many of the objectives they have mastered and the ones with 
which they will have trouble. Why not administer items to them 
which reflect objectives that you know they have attempted? Why 
not give them a chance to experience a feeling of accompUshment 
and success? If it is not absolutely necessary for you to make 



Classroom Tests: A Survey of the Terrain 83 

comparisons among the students, where some students are sure to 
come out wanting, why do it? 

Suppose you take the example of a unit for which you have 
specified some fifty objectives in performance terms. The objec- 
tives range along the continuum from knowledge-based objectives 
through tasks which require synthesis and analysis. The students 
work at their own rates, staying on a task until they complete it — 
after which you administer some test to document the completion 
of the task. After two weeks, you feel it is necessary for all the 
students to move on to a new unit. Some have already mastered 
the fifty objectives and are working on supplementary things. 
Others will just complete the fifty objectives, while some will not 
have completed the entire list. To test, you could simply assemble 
measures which reflect the objectives which each student has had 
a chance to attempt. Of course, it may be difficult for you to make 
absolutely precise decisions about each student, but you probably 
have a pretty good idea of how much each person can accomplish. 



Questions 

6. Consider the hypothetical case where you are sure many of your 
students will not have time to completely master all of the objectives 
of a particular course of instruction. 

a. Think of a specific situation where you would seek to have them 
completely master at least half of the units, without even trying 
the other half. 

b. Think of another specific situation where you would seek to insure 
partial mastery of all the objectives, although the student would 
not completely master many of them. 

7. Look at Table 3 on page 77. Which test administrative decision 
would you support (A, B, C, or D) for these situations? 

a. The course in which you are now enrolled. 

b. A high school algebra class. 

c. The evaluation you intend to do when, or if, you begin teaching. 
In each case, briefly explain the reason for your answer. 



What Are Your Expectations for the Students? 

Another major decision, which will be reflected in the way you build 
your test, design the testing sequence, and report the results, has 



84 Educational Measurement 

to do with the expectations you have of the students. How funda- 
mentally important is it that all students completely master all 
objectives? Consider this thought problem: 50 units are to be 
covered in a limited amount of time. One student could either 
(a) reach 50 percent mastery on all 50 units, or (b) reach 100 
percent mastery on 25 of the units. Which do you choose? 

Without more details you probably wouldn't be in a position to 
decide. It depends on the relative importance of the information 
he would miss from the last 25 units if he only attempted the first 
25, or the fundamental importance of the 50 percent of information 
he would not get if the half-coverage option were chosen. Actually, 
when you administer some evaluation device to the members of a 
class, your expectation probably falls into one of these four 
categories: 

A. The objectives are so fundamentally important that you must 
insist that all students master them completely. The multiplication 
facts and the recognition of the letters of the alphabet are exam- 
ples. If you are presenting a social studies unit involving the federal 
court system, you may consider it absolutely necessary for future 
learning that each student start with an understanding of the 
structure of the system. In science you might demand complete 
mastery of use of certain measurement equipment as mandatory 
for all. Your expectation, then, is that no range of performance will 
be accepted. All will master the objective. 

B. You expect fairly complete mastery, but not at the 100 per- 
cent level. The entire discussion of minimum acceptance levels from 
chapter 2 is applicable here. You don't have to spell every word 
correct to get by in our society, nor is it particularly damning to 
make an occasional error in arithmetic computation. The students 
may not need to master all of the concepts or learn all of the skills. 
Some minimum acceptance level may be appropriate — say 85 or 90 
percent. Again, however, you expect all to reach this minimum 
acceptance level. The only range of performances will occur between 
that level (say 85 percent) and complete mastery. 

C. You expect a distribution of performances, ranging from some 
who master most of the objectives to some who master few. This 
must be the most common expectation that teachers have, for this 
is the usual result from classroom devices. 

D. You want to maximize the range of performance in the group. 
This could most easily be accomplished by concentrating efforts on 
those who previously were high performers. The performance of 



Classroom Tests: A Survey of the Terrain 



85 



these people could be raised higher still, increasing the range of 
scores in the class. 

Your own personal philosophy of education becomes important 
as you establish these expectations. The relationship between the 
style of teaching to which you subscribe and the expectations you 
have of the students is very close. Consider the graph of figure 3. 



Time that 
you (the 
teacher) 
spend with 
the student 



D. Maximize 
variability and 
range 




Usual performance level of the student 

Figure 3 

Relation between expectation of students and 

distribution of teacher's time 

On the horizontal axis the usual performance level is plotted. 
The performance levels of the students range from low to high, as 
measured by the number of objectives accomplished. Note that the 
term "ability level" is avoided, indicating the author's bias with 
regard to abihty tests." On the vertical axis is the amount of time 
you (the teacher) spend with the individual student. The four lines 
indicate a hypothetical distribution of time which you would spend 
with students at various performance levels under the four different 
expectation plans. 

To elaborate further, look at Une A, corresponding to the expec- 
tation that all students will reach 100 percent mastery on a limited 
number of fundamental objectives. These objectives obviously 
could not be pitched at too high a level, or the promise of complete 



6. The topic of ability tests is covered in chapter 7. 



86 Educational Measurement 

mastery of 100 percent of the normal students would be unrealistic. 
In a sense you, the teacher, would be making a contract with the 
parents which says, "I promise that all of the students will at least 
learn to do the following things: . . . ." With whom would you 
spend the least time, given expectations like these? Clearly the 
students who are already performing at a high level would reach 
the objectives with little help from you. Who would get most of 
your time? The current low performers, of course. Thus, to estab- 
lish minimum mastery levels the higher performers would be 
slighted in favor of the lower. 

Line B is only somewhat different. Now you are not establishing 
100 percent mastery levels for a minimal number of fundamental 
objectives, but have lowered the criterion level to something less 
than 100 percent. Presumably this change would be accompanied 
by an increase in the number and complexity of the objectives that 
you promise to achieve with the students. Since you no longer 
require 100 percent mastery, many of your higher performers will 
already be up to criterion performance, and will need none of your 
time. Once again, your major efforts will be concentrated on the 
low-performing group. 

Line C indicates the "equal effort" scheme of things. Isn't it 
fair to devote an equal amount of time to each student? Is it? That 
depends again on your own personal philosophy. A very clear 
ramification of such a philosophy is that it will lead to a distribu- 
tion of scores in the classroom. It is conceivable that the higher 
performers benefit more, per hour of instruction, than the lower 
ones. Thus, as time passes, the difference between high and low 
performers will be increased. Is that outcome acceptable to you? 

Finally, concern for the higher performers might be a major 
factor. After all, our society does need leaders. Those who have, in 
the past, indicated that they are capable of high levels of perform- 
ance should receive as much — ^if not more — of your time as any 
of the others in the room. Should they? Again, this answer is a 
function of your personal philosophy. 

Of course, the issue is not really as simple as shown on the 
graph. Actually you might be able to follow scheme A, or a slight 
modification thereof, and still pay enough attention to the higher 
performers. This could be done without much actual time commit- 
ment on your part, as long as some self-instructional materials or 
individualized projects were available. The higher performers, after 
all, are probably most capable of working alone at their own pace. 



Classroom Tests: A Survey of the Terrain 87 

The type of evaluation system you choose is not some unimpor- 
tant decision which can be made after all the "important" instruc- 
tional decisions are completed. The evaluation system you choose 
reflects the expectations you have for your students, and the rela- 
tionship between the expectations you have and the way you 
distribute your time is fundamentally important. Which expecta- 
tions are you most comfortable with? Your answer might tell you 
some things about yourself. 



Questions 

8. Figure 3 shows four ways the teacher can divide available instruc- 
tional time for the students. For fourth-grade students 

a. Which technique do you most support? Explain. 

b. Which technique do you least support? Explain. 

9. Change the situation of question 8 to focus attention on students 
enrolled in high school algebra. Do your decisions for questions 8a 
and 8b stay the same? Explain. 



Reporting the Results 

The final decision area concerns the system of result reporting 
which you choose to adopt. Your system of reporting is controlled 
by the earlier decisions you make about- specificity and complete- 
ness of objectives, testing sequence, and expectations. You are 
probably most familiar with normative reporting systems. Grades, 
percentile ranks, standard scores, and grade equivalents represent 
commonly known examples of normative reporting systems. The 
identifying characteristic of a normative reporting system is the 
word comparative. An individual's score is interpreted as a com- 
parison to other individuals' scores on the same measure. What 
does it mean to have a percentile rank of 75? It means that, 
compared to the others, you are better than 75 percent. What does 
a 600 on the College Boards mean? It means that, compared to the 
others, you have surpassed about 68 percent. What does a grade of 
C mean? It means that, compared to the others, you are about 
average. Chapter 10 will go into more detail on the reporting sys- 
tems mentioned above. 



88 Educational Measurement 

The other general type of reporting system is criterion refer- 
enced. With a criterion referenced system you do not report to the 
student's parents in the sense of comparing his performance to 
other students. Instead, you report whether or not the student has 
reached a specified criterion. Sometimes this will involve a single 
concept or objective. Has the student reached 85 percent mastery 
on the content of unit 3 in science, which concentrated on the 
metric system? Has the student reached 100 percent mastery on 
the multiplication tables? The report involves a simple "yes, he 
has" or "no, he hasn't — not yet." 

At other times a variety of objectives wiU be involved. The re- 
port in these situations would consist of a checklist, with the 
mastered objectives checked off. The checklist would be a form of 
continuous progress reporting. Again, a more complete elaboration 
on reporting systems will be included in chapter 10. 



Now, How Do All These Things Fit Together? 

The five general decision areas are briefly summarized in table 4. 
The decisions you make in one column of the table are not inde- 
pendent of the decisions you make in each of the others. For exam- 
ple, you cannot have criterion-referenced reporting if the objectives 
of instruction have not been specified in performance terms. Your 
expectations will be reflected in the way you report the results as 
well as the manner in which you specify objectives. 

In the previous sections, the goal has been to survey the terrain. 
Hopefully you now have a clear idea of the key decisions you must 
make before preparing some sort of evaluation for the classroom. In 
the next chapter a variety of ways of asking questions or obtaining 
information will be described, and chapter 6 is reserved for applying 
the information presented in this chapter and the next to a variety 
of practical situations — the kind that you probably will run into in 
the classroom. 



Evaluation in the Classroom 

Behind the thoughts in the preceding sections is the implied belief 
that most evaluation which occurs in the classroom is carefully 
planned in advance. The implication is misleading. To be sure, 
a large proportion of classroom evaluation should be carefully 



to 

o 

to 
to 



e 
o 

o 

to 

e 
o 
O 



to 

a 
c 



o 



1) 






73 





-M CO 



V a 

O O 

J5 






> w 

X 

S 
o 
O 



si 

too 
.■So 

a 

ta u 

a" 
CO 



V " t« 
<ua ♦'^ S 

■2 .§ a^ 

>H tO.S s to 
ft 0) -8 



^ o O 

« B S 

« cs 

O 3 



Sa 



S ea « 



u M 



(U 






s 



m 



a 




T3 rCJ. 

g lgl 

"r.S ^1 a 

c 2 o 2 
flj a'.fj oj erf 



<N 



gj3^ ta 5 



< 3 a* 

^ P 3 r. 



(Nja 



> 0) 2 . a 

^^•-■» 

•^ <U OJ fl 
r n 3 0) n 

■ > a s 



a ti fi fi 
" « m g 

*^ ,« Q> tj 

o42te » 

S^ . 

S 3-« 2 

" •-'^ s 

Qj m Q) w 

C to -r; 

''■' g CO a ta 

Oj- - 



alc° 
l^a^^ 

>- '^ 3 QJ W 
CO CO 3 

< s ^-s-^ 

(N .« C ^' 0) 
"X ftS ° 

-B ."34S 
tS !»>""§ 

^ m s" I 



_ o S Sji 
"S m a ** CO . 

P.S fl o g 3 
«-C O o 2 m 

S! a o «> 9 a 

l.ajftss 

CO aa So* 

^sa 

oaS 

I" I 

CO M-i !> 'B 



u 

^ a 



a , 

'Sg 



•sSa 



3! 



2 2 

. 3 « 
■<* CB 



■S§2 

Isa 

3 »«2 

o- S 
o a 2 

S be 

ft.S£ 



-Si 

„ S a 
2 ■" ta' 

•" S-B 

S 3 s 

fit a +i 
t£ 3 

S S «3 


08- 



■35 






is 


■S 





+j 


> 


a 


s, 


3 






S 


1 



•^S 0^ rf 
2^3 -« 8 
S jj +^ -ti 

0'3 e " a 

s § i^ ^ 

--B g Sf 

0.SS _ a 

M-B-B o c 
" a 



b^ a 



sS.s 



*i 

o o c ■*; 
» 3.5 3 



(m'Bu .<t^0*.n0 



o a.a 

cj ^'B 

0.5 

en 

^ CO £0 

"B '— t- 
IJ m 3 
S « 

. O CO 

<.0 a 
m o^ 



.S o 



a 

3 

a 


S^ 

g-° 

••C 2 . 
53 fci 
IS.-S » 

ni •? rt< 





, T3 



rB .- 



Q) W 

SB ■ 











(N ft E 3 ^.S 



ft "tt-Srs 

M 0~ ft 

!=l So ? O 

ag^ bj)o 
o K a^ o 



t; o 3 
0-2 2 2 
ft a 3 be 



jj 2 2 " 

ts s o a o 
•^ S g.^ 2 

£|<2-2S 

o|a«5 

^Bft^O 

• iSo>>> 

Tf J3 43 0X! 



89 



90 Educational Measurement 

planned. But a tremendous amount of informal and necessarily 
unplanned evaluation occurs in the classroom too. It cannot be 
planned because the evaluative situation cannot be predicted. This 
evaluation is usually very individualized and fundamentally non- 
quantitative. 

When you lead a discussion, the questions which you ask differ- 
ent individuals are based on your evaluation of each individual's 
ability to respond correctly. Tommy is asked fairly straightforward 
questions, because anything too difficult will tongue-tie him — ^that's 
your evaluation based on past knowledge. Mary raises her hand for 
every question. You only ask her those questions which you're 
fairly certain no one else can handle. You make affective evalua- 
tions. If a discussion begins to get sterile you go to Louis, who can 
always be expected to give a creative or humorous response. A 
question about sewing, homemaking, or children will be directed 
toward Harriet, Louise, or Alice; about sports to any of five differ- 
ent boys; and about science to Laura or James. Each of these 
decisions is an evaluative one. You evaluate implicitly in these 
choices, communicate your evaluations to the students as you 
choose. 

Those kinds of evaluations of your students are unavoidable. Be 
aware of this fact, so that you can avoid the specter of the "self- 
fulfilling prophecy." Simply, this notion means the student will do 
what you expect him to do. If you expect him to be average, you 
will ask questions and give assignments which are designed to bring 
average responses. Then when you see the average responses, it 
reinforces your belief that he is average — it's a vicious circle. Like- 
wise, if you always ask Harriet questions about homemaking, or 
James questions about science, they wiU give homemaking or 
science responses, respectively. Your evaluation of their respective 
interests is reinforced — you have prophesied their interests and 
reinforced your own prophecy by acting on it. 

Well-planned evaluations can help you break out of the vicious 
circle with the students. If the measure is specifically tied to per- 
formance objectives, Louise can show you that she has other inter- 
ests and capabilities than ' homemaking, and James can indicate 
interests and aptitudes beyond science. Learners who had previ- 
ously been pigeonholed in your consciousness as high performers 
may have trouble with the planned evaluation. Those who had 
previously been thought of as slow may simply have been thorough, 
and in the formal evaluation may surprise you with a high degree 
of mastery of the objectives. 



Classroom Tests: A Survey of the Terrain 91 

No teacher can avoid making judgments about the students, 
categorizing them as one type or another. This is what a human 
being must do to keep his world from becoming too complex to 
handle. Unhappy results from the student's viewpoint will occur 
when the categorization is inaccurate. As you increase the propor- 
tion of your evaluations which are planned and built around objec- 
tives, the proportion of miscategorizations should decrease. The 
goal is worthy of the effort. 



Question 

10. Think back to your senior year in high school and select the course 
you liked best at that time and the course you liked least. 

a. Categorize each of the two courses in each of the five columns of 
table 4. 

b. How do the categorizations differ? Do you think any differences 
in categorizations may have contributed to your liking one better 
than the other? 

0. Suggest changes in the evaluation system for the course you 
liked least, which may have caused you to like the course more. 



Modes of Transportation: 
Questioning Formats in 
the Classroom 



This section of the book is designed to help you, the teacher, under- 
take effective evaluation in the classroom. Using the procedures of 
chapter 4, you should be able to systematically arrive at a pretty 
specific picture of the type of measure you need. To obtain informa- 
tion, the students must somehow be allowed to teU or show you the 
extent to which they have fulfilled the objectives. How do you 
probe to obtain the information? Do you watch the student or ask 
him? How do you ask him? 

This chapter starts off with two information-gathering tech- 
niques which can be effective, but which have not received the 
proper amount of attention in classroom evaluation. These two are 
performance measures and unobtrusive measures. Neither requires 
the kind of "test" you are most familiar with — the "paper-and- 
pencil" test. Both escape the artificiality of a "paper-and-pencil" 
test. 

After dealing with the use of performance and unobtrusive 
measures in the classroom, the more commonly used classroom 
testing techniques will be covered. These have been categorized 
under the headings of supply and non-supply items. The supply 
type of item requires that the student supply a response from 
memory. Essay, short answer completion, and completion questions 
also require information from memory. Non-supply items have a 
limited or restricted range of potential responses. The student 
identifies or recognizes the most correct response rather than re- 

92 



Questioning Formats in the Classroom 93 

calling it. Included under this category are multiple choice, cate- 
gory matching, and true-false items. 

Other questioning techniques are amenable to classroom use. 
You will frequently be interested also in the student's attitude or 
feelings about some topic. Measurement of such objectives is 
covered in chapter 11. 



Performance Measures 

"I don't care what you know. What can you do?" 

Sort of an anti-intellectual statement, isn't it? Knowledge for its 
own sake is important. From infant to child to adult to the geriatric 
ward, intellectual curiosity is an asset. At some time, just about 
everyone wants to know something for the sake of knowing and 
not necessarily because a practical use is on the horizon. 

Still, there's a broad strain of practicality in our national con- 
sciousness. "Why do I have to learn that?" You've probably said 
that yourself! The reasoning behind such a statement goes some- 
thing like this: I am spending time learning this — there must be a 
reason. What will I be able to do or do better when I have finished 
learning? How will this help me later to do something better? 

The school needs to strike a balance, of course, between con- 
centration on very practical skills on the one hand, and concentra- 
tion on background knowledge which is not immediately practical. 
The background knowledge is designed to prepare the student to 
deal with the unknowns that loom in the future. This line of reason- 
ing leads us to the conclusion that one of the primary purposes of 
our schools will always be to help the student "do" certain things. 
Performance of some act for the student after instruction, which 
could not be performed before instruction, will frequently be the 
ultimate goal of your teaching. 

The above statement being true, it seems reasonable that the 
best way to measure such a change is to determine it the student 
can actually perform the task at the end of instruction. Not talk 
about the task or describe it; not answer multiple choice questions 
about it; but actually do the task. The measurement of a student's 
ability to do a task, by actually observing him in action, is called a 
performance test. 

All of this seems so reasonable it makes you want to scratch your 
head and say, "Why would we ever give anything but performance 
tests?" The answer is a single word: efficiency. Performance tests 



94 Educational Measurement 

take a lot more time than paper-and-pencil measures. A paper and 
pencil test can be administered by one person to a group of stu- 
dents. The scoring can be done quickly using a simple key or 
possibly a machine. It requires no special equipment, space, or 
set-up time. 

A performance test, on the other hand, will almost always be 
administered on an individual basis. Since the student will be asked 
to perform some act, whatever equipment, space, or materials are 
required for the act must be provided for each student. The rating 
scale for the observer of the performance is usually tedious to 
construct and usually requires a pilot study. 

At times though, the performance measure is worth the extra 
time and trouble. Talking about an act is sometimes a lousy substi- 
tute for doing it.' Situations will occur in your classroom where (a) 
you do need to measure the students' behaviors on some important 
objective; (b) where the objective clearly involves the performance 
of some learned skill by the student; and (c) where nothing short 
of actually performing the skill can tell you whether or not the 
students have learned it. 

Take this essay question: Outline the elements of a good golf 
swing. The answer probably won't be very predictive of average 
score for nine holes. No set of multiple-choice questions about using 
a microscope can really tell you if the student can focus this com- 
plicated machine on a new specimen. You can't talk about dance, 
you must do it. The same holds with public speaking, hitting a 
baseball, typing a letter, weighing a residue, surveying a field, or 
operating an automobile. Each fits the guidelines given above. 
Namely, a measure is needed, the objective involves a learned skill, 
and no paper-and-pencil test could really catch the spirit of the 
thing. 

Suppose you decide that you have a place where the only appro- 
priate measure is some sort of performance test. How do you set it 
up? 

1. The author once taught a group of ninth graders to be great paper- 
computers of volumes. Given a picture of a sphere, cube, cylinder, or 
rectangular solid, with appropriate dimensions listed, they could compute 
volumes almost without fail. Just for fun, one day I handed them a standard- 
sized can of soup and asked them to tell me how much soup was in it. None 
recognized it as a cylinder. None sought a ruler to measure height and 
diameter. It had seemed reasonable to assume that if they could compute 
volumes from dimensions shown on diagrams, they could also handle real 
objects — and the diagrams were, after all, a lot easier to administer than 
was the performance test. The assiimption was wrong, however. They did not 
make the transfer from paper to reality. 



Questioning Formats in the Classroom 95 

1. Start by doing a task analysis. That simply means you break 
the entire performance (as best you can) into its component 
parts. You just have to specify the important elements — actually, 
the task analysis can be pretty crude. 

2. Since performance measures are time consuming, you prob- 
ably shouldn't set out to evaluate every student on the entire act. 
Pick out a few key elements. Assume that if the student can per- 
form certain parts of the sequence satisfactorily, he is capable of 
the whole sequence. Take this objective: The student should be 
able to prepare and identify microscopic slides for the specimens: 
(long list of names). If the student can prepare a few such speci- 
mens — or even a single one — while you watch, it probably means 
he could prepare and identify others. 

3. Arrange the details of testing carefully. Remember, you wUl 
be testing only one at a time. What will the others be doing? If the 
first ones tested are observed by the others, do the later ones have 
an advantage? To avoid this, do the testing in an obscure place in 
the room or behind a screen. Perhaps you won't actually have to be 
at the testing site at all times. For example, to see if the student 
can put knowledge of latitude and longitude into practice, you 
might set up a globe behind a screen. On a written test, you assign 
a latitude and longitude. One at a time, each student goes to the 
globe and tells you what city is at that point. Conversely, you 
could name a city and have each student read the latitude and 
longitude. At other times, you may have to watch each student 
perform the task and keep a record. 

4. Make sure you have developed an objective scoring system for 
the test. If you're hstening to students give oral presentations, or 
watching them focus a microscope, weigh a specimen, repair an 
engine, or develop a picture, have the scoring sheet prepared in 
advance. Treat each student equally. Give each person precisely 
the same instructions. Score them all the same. 



Questions 

1. Think of two examples from your past when your performance has 
been measured by a legitimate performance item or series of items. 

2. Devise a performance measure which would be appropriate for the 
course in which you are now enrolled. 

3. In question 10 of chapter 4, you were asked to think of the high 
school course which you liked least at that time. Devise a perform- 



96 Educational Measurement 

ance measure which fits some part of that instructional program 
which you think might have led to making the course more relevant 
for you. 

4. Since performance measures are time consuming, one frequently 
samples only key elements from an entire task and not every single 
part. To give you some experience in applying this concept, do one 
of the following: 

a. Find a science laboratory guide (from an elementary school sci- 
ence program, or from some high school physics, chemistry, or 
biology lab) and choose one experiment or exercise from it. Do a 
task analysis of the experiment, then devise a series of perform- 
ance measures which sample the student's ability to perform the 
key tasks. 

b. Think of some task from the primary field in which you teach or 
intend to teach. Again, do a task analysis of the steps and sample 
from these steps certain key tasks. As in (a) above, explain pre- 
cisely how you would set up the testing procedure and how you 
would evaluate each student's performance. 



Unobtrusive Measures 

When a person is being evaluated, and knows he is being evaluated, 
the evaluation act itself may alter his "usual" behavior. Reactive 
measures are those wherein the individual responds in a slightly 
abnormal manner simply because of the event of measuring some- 
thing about him.^ The reaction to the measure is most likely 
triggered by measures of attitude or interest, but some measures of 
achievement are also suspect. You've heard of the person who 
"really gets up" for a test or the other extreme, the fellow who 
"chokes." What that really means is that on a day-by-day basis, the 
person would perform at level X, but because of the presence of a 
testing environment, he performs at some level different from X — 
higher or lower. The testing process causes a change in normal per- 
formance level. 

An unobtrusive or nonreactive measure is one in which the 
"respondent" doesn't really know he is being measured. Thus, he 

2. For a complete study of this type of measure, see Eugene J. Webb, 
Donald T. Campbell, Richard D. Schwartz, and Lee Sechrest, Unobtrusive 
Measures: Nonreactive Research in the Social Sciences (Chicago: Rand 
McNally, 1966). 



Questioning Formats in the Classroom 97 

cannot react to the measurement. The interaction between the 
evaluation process and the individual performance cannot occur. 

For example, suppose you teach a unit on South American his- 
tory. You want to know if the unit has stimulated interest in or 
changed attitudes about South America. You could question them 
directly with a questionnaire or some oral probes, but the fact that 
you initiate the question may bring on an artificial response. You 
brought the subject up — the student didn't. Some will respond in a 
manner which is designed to please you. Others will purposely try 
to annoy you or burst your bubble. In neither case would the stu- 
dent be divulging his real feeUngs — only the feelings he wants you 
to think he feels! 

How could you get a nonreactive measure? You might check the 
library to see if the frequency of South American book checkouts 
has increased. Slip a "where would you rather visit" question into 
English or science class, listing cities all over the world, with a 
couple of key South American names tossed in.^ Do the books on 
South America seem to disappear during free reading period? In 
the extreme, you might slip a tape recorder into the washroom to 
see what they talk about when left alone — but that's a little un- 
ethical (not to mention dangerous, since your ego may not be able 
to stand what you hear) . 

Which exhibits in a museum are most popular? In the past one 
method used to answer this question involved measuring the fre- 
quency of floor tile replacement around the exhibits. Where people 
walk most frequently, the tiles wear out faster. Thus, an unobtru- 
sive measure of popularity is available. Which science topics are 
most popular with fifth graders? See which pages in the encyclo- 
pedia of science have the most dirt, thumbprints, or overturned 
corners. Is the new fine arts program having any effect? See if more 
students go to available performances. See what radio station is 
turned on most often. 

An old sports story concerns a young man who might have been 
the greatest baseball pitcher of all time. His fast ball could go 
through a brick wall. His curve broke at least sixty degrees. His 
control was excellent and he was strong as an ox. His only failing 
was a critical one. He couldn't pitch when anyone was watching. 
With spectators, his value was zero. 



3. If you plan in advance, you might administer a modified version of the 
"where would you rather visit" question before your South American unit 
also, to detect change. 



98 Educational Measurement 

The moral is fairly clear: Just because a "reaction" occurs when 
the person knows he is being measured, it does not imply that such 
measures should always be avoided. All people must have the expe- 
rience of performance under pressure — the pressure that comes with 
knowing that you are being tested. Most students get used to the 
pressure so that very little interaction actually occurs, probably far 
less than the student really thinks is occurring. Thus, the two 
general cases where a reaction may occur but where you can right- 
fully ignore it are (a) where the reaction is customary and small, 
and (b) where the reaction may not be small, but where the inter- 
action between measure and performance is more hke reality than 
a situation where the respondent does not know the measure is 
occurring. 

The sections on performance measures and unobtrusive observa- 
tions have purposely been placed first in this chapter. By this time, 
perhaps you have begun to develop the philosophy that not all 
evaluations in the classroom must be based on objective, paper- 
and-pencil tests. The key thing to remember is that you must 
establish the objective first. What changes do you wish to see in the 
student? The manner which is used to evaluate degree of attain- 
ment of these objectives follows as the second step. Don't limit 
your measurement techniques to a few common types. Make the 
measure fit the situation. 



Questions 

5. An eighth-grade teacher presents a four-week unit on the police. 
Besides reading assignments, the students have visits from repre- 
sentatives of the police department, lawyers, and local groups which 
have charged police brutality. The teacher wishes to find out if the 
attitude of the students toward the police has changed. Devise two 
unobtrusive measures the teacher might use. Make them reasonable 
— measures a real teacher might actually obtain. (Remember, un- 
obtrusive measures can be "staged.") 

6. For a prospective elementary teacher to find satisfaction in teaching, 
a positive attitude toward young children is probably mandatory. 
Devise at least one unobtrusive measure which might be used by a 
college with prospective enrollees which would give information 
regarding the attitude of these people toward young children. 



Questioning Formats in the Classroom 99 

General Considerations in Achievement Testing 

The general goal of this chapter, as you will recall, is to provide a 
useable guide to the construction of measurement devices for the 
classroom. A few general considerations permeate all types of 
achievement tests. These are suggestions to keep in the back of 
your mind regardless of the specific item types you decide to 
include. 

Before you begin any writing activities, consider these: 

A. Many teachers have the mistaken impression that objective 
items (multiple choice, matching, true-false, to name the most 
common types) can only be applied to measuring knowledge-based 
objectives. Unfortunately, this is too often the case in practice. 
Objective items can be used to measure higher level learning. Com- 
prehension can be sampled by presenting a stimulus (stem) and 
asking the student a question which requires that he interpret the 
passage. Applications of learned principles require the presentation 
of a novel situation which had not been previously introduced in 
class discussions. Discriminative thinking can be sampled by pre- 
senting plausible responses to a question where one answer is the 
best — a situation well known in real life where one often must 
choose the best of several competing and acceptable alternatives. 
Set aside a few neurones in the permanent storage area of your 
brain to trigger these two bits of philosophy: 

1. Objective test items in a multiple choice, matching, or true- 
false framework can sample cognitive levels above routine memory 
of information. You do not have to go to an essay item for these 
kinds of measures. 

2. At the same time, don't be apologetic if you decide it is 
necessary to evaluate the student's abihty to recall or recognize 
factual information. Before a student can think critically, he needs 
something to think critically about! The error comes in only evalu- 
ating the recall or recognition of factual material. 

B. Decide in advance the number of items or points you want in 
the test. No rough rule of thumb can be provided linking number of 
items to time required or available, since this relationship depends 
on the complexity of the item and the reading ability of the stu- 
dents in the class. At the college or high school level, one can 
usually figure that about one minute per multiple-choice item will 



100 Educational Measurement 

allow 95 percent of the students to finish in ample time. But you 
are the best judge of your own students. 

C. Distribute the coverage of the items in advance, if you can. 
You might use the technique outlined in the last chapter, allowing 
a certain number of pages per item. If some sections are more 
important, in your estimation, you probably will want extra empha- 
sis at these points. While it is possible to simply weight those items 
more heavily (count each as two or three points) a better technique 
is to write more items from those areas which you feel are more 
important. 

D. Have a single correct answer for most items. Students work 
through a lot of multiple-choice tests as they make their way from 
elementary school to graduation. If you occasionally slip in a 
"choose the two most important reasons from the Ust . . ." type of 
direction, you will surely catch a few who respond to just one out of 
force of habit. Remember that ultimately you are sampling learn- 
ing, so don't let the test format get in the way of the student's 
telling you what he knows. If you want to measure his ability to 
follow changes in directions, write a special test for this skill — don't 
sneak it into a test designed for another purpose. 

Sometimes, though, a double or triple answer question is the only 
appropriate kind. If you want to include some of these, be sure to 
underline the directions or put them in capitals. Congregate items 
of this type together in the test. A minor scoring problem develops 
with double-answer questions. With a multiple-choice item one 
point is allowed; the item is either right or wrong. If you ask for 
two answers selected from a list of possible responses, you must 
decide how to handle wrong responses. Here's an example showing 
the difficulty: 

Select the two most correct responses: 





A. . . . 


X 


B. . . . 


X 


_ C. . . . 




_ D. . . . 




_ E. . . . 



The correct answers are B and E, but the student marks B and C. 
Has the student earned one point for choosing B? Or zero for 
choosing one correct and one incorrect? How about a student who 



Questioning Formats in the Classroom 101 

marks three answers? One way to handle an item like this is to 
score it as five points. The student receives one point for having no 
mark at A, C, and D, and one point if B and E are correctly marked. 
In essence, the item is scored as if it were five true-false all wrapped 
up in one. 

E. If any possibiUty exists that the test, or some parts of the 
test, will some day be machine scored, limit the number of possible 
responses for each item to five. 

F. Make some sort of plan, if possible, for a tryout, even a very 
informal one. Another teacher would be a good tryout person. Two 
older students would be satisfactory. The purpose of this step is to 
seek out the more obvious bad questions — statements which are 
ambiguous or just plain difficult to understand. 

The following are a few suggestions to keep in mind regarding 
the way you assemble the test: 

G. Keep your item types together. A single test could well have 
a variety of item types chosen from true-false, multiple choice, 
short answer essay, matching, and some others. Group them by 
type. If you mix them up, you might mix the student up also. Such 
outcomes would break one cardinal rule of testing: Don't let the 
item format get in the way of the student's telling you what he 
knows. 

H. Within an item type, group together questions from a given 
content area. For example, if you are going to write five questions 
in a multiple-choice format on the topic of church-state separation, 
group the five questions together. Don't force the test-taker to 
spend time searching back through all the other questions for 
earlier responses. 

I. Within an item type (and without breaking the rule stated in 
(H) above) arrange the items from easy to difficult. 

J. Don't change pages in the middle of an item. If possible, don't 
even change columns in the middle of an item. 

K. Keep the responses to multiple-choice or matching items in a 
column. Don't spread them out horizontally across the page. 

And one final thought to help you continually improve your 
classroom testing program: 

Write each separate item on an index card. Record there on the 
card the correct answer, a catalogue of the times you have used the 



102 Educational Measurement 

item and any comments about it, such as its apparent difficulty. 
Eventually, when you develop a backlog of item cards, you will be 
able to do individualized testing, or to change a small proportion of 
items from test to test to discourage students from "learning for 
the test." 



Question 

7. Find two tests which have been administered to you or a friend of 
yours in a college course. Have any of the points listed imder 
"General Considerations in Achievement Testing" been broken? 
Which? How would you correct each of them? 



Supply-Type Questions 



Essay-Type 



You have undoubtedly been asked to respond to essay-type ques- 
tions. Here are two examples: 

Mention three different types of reliability coefficients, indicating 
how each is distinctively different from the other two. 

Trace changes in the attitude of Americans toward government aid 
for the poor and needy. Begin in the early 1700s and continue 
through the current time. Mention specific historical events where 
these are appropriate. 

Classroom teachers like essay tests. The use of essay items has a 
pervasiveness which is unshakeable even in the face of serious 
objections by testing experts. If you do teach, and if the students 
in your room are advanced enough to write, the probability is that 
you also will write essay items. Well-written essay items can be an 
excellent measurement device. How does one earn this "well- 
written" designation? 

A. The strength of the essay question draws from the possibility 
of higher level questioning. Use words like "contrast," "compare," 
"apply the principles of ... to the following," and "give new 
examples of. . . ." Try to go beyond what you or the text have 
covered previously. Look to the present. How should the student 



Questioning Formats in the Classroom 103 

now be able to use this information? Look to the future. How will 
the student be using this information in the units soon to be 
covered? 

B. Where possible to do so, a series of short-answer essay ques- 
tions are better than one more general question. The suggestion is 
especially true if your interest is in the application of learning to 
specific situations, rather than recall of the situations followed by 
the applications. For example, consider this general question: 

Describe how different kinds of pollution adversely affect our en- 
vironment. Give examples. 

Unless you particularly want the student to recall the types of 
pollutants (in which case an objective-type recall question is dic- 
tated) you could reword the question to more specific ones hke the 
following: 

Describe the process whereby thermal pollution causes adverse 
effects in a lake. 

Contrast the effects of carbon monoxide, sulfur dioxide, and soot on 
man and his environment. 

Describe the process whereby DDT causes softness in eagle eggs. 

C. Frame the question so that there is a correct response; that 
is, a response that people knowledgeable in the field would agree is 
correct. The suggestion means you wiU need to avoid starting ques- 
tions with, "In your opinion" or "What do you think about . . ." 
and lead-ins which cannot be challenged for accuracy. After all, if 
you ask for someone's opinion you can later disagree, but you can- 
not evaluate the answer as wrong. Only the responder knows his 
real opinion. In certain situations, an opinion or a personal interpre- 
tation may be the appropriate thing. In these situations, you must 
then evaluate the response on the basis of clarity of thought or 
expression or the logic of the argument, rather than on the basis of 
"correctness." 

D. If the items are to be used with students from varying per- 
formance levels, try to set the items up so all can make at least 
some response. 

E. In general, don't allow a choice of questions. Often teachers 
include directions like "Answer three out of the five" or "Answer 



104 Educational Measurement 

any four questions from this list." Such a procedure grows from the 
belief that this is a way of being fair to the student. Really, though, 
the student is forced to make a difficult decision before he begins 
writing. Very quickly at the beginning of the test, the student must 
decide which items he can respond to best. How often have you 
started answering a question only to find, about half way through 
your answer, you really don't know what you are talking about? 
Better for you, the teacher, to decide in advance which questions 
you consider important for each pupil, and ask those questions. Of 
course, this suggestion does not negate the suggestion from chapter 
4 that you might assign different questions to different individuals, 
as a function of the individuals' previous experiences and perform- 
ance levels. 

F. A suggestion which should never be skipped: Write the ex- 
pected response to the question as you write the question. Don't 
wait for the students' responses to come in before you develop your 
own required answer. That's not fair! If you intend to put them on 
the spot for a definitive answer, you should at least be able to give 
the answer yourself. Also, by answering the item yourself, you may 
be able to pinpoint ambiguities in the way you've worded it, or you 
might find out that the answer is far too elaborate and lengthy for 
the time allowed. 

G. Developing model answers as you write the questions will also 
allow you to develop point values in advance. If you are going to 
assign different weights to different items, it's only fair that you tell 
the student of this in advance so he can concentrate his efforts 
where the payoff possibility is greatest. 

All of the preceding suggestions for framing the question are 
quite practical. Following them should not be too difficult. In 
scoring an essay test the precision tends to break down. The 
reliability of your test is a direct function of the precision of your 
scoring procedures. The scoring task is where the test experts 
challenge the use of essay tests. How can you make your scoring 
techniques more precise? 

First, follow the earlier suggestion of a model answer with asso- 
ciated point counts, written before the test is administered. Here 
is an example, taken from a sixth-grade science test:* 

4. John G. Navarra, Joseph Zafforoni, and John W. Wick, Achievement 
Tests for The Young Scientist: His Problems and Methods (New York: 
Harper and Row, 1971), p. 31. 



Questioning Formats in the Classroom 105 

10. A lake is clean. Suddenly, a lot of phosphates are dumped in 
the lake. A few years later, all the fish are dead. Describe the steps 
(each process) involved, from the dvimping of the phosphates to 
the death of the fish. 

Model Answer: (1 point) Mentions phosphates food for algae. 
(1 point) Thus more algae grow. (1 point) Bacteria grow on dead 
algae. (1 point) Thus increased algae lead to increased dead algae, 
which leads to increased bacteria. (1 point) The bacteria use the 
oxygen in the water so increased bacteria cause increased oxygen 
use. (1 point) Thus less oxygen available to fish and they die. (6 
points total) 

Usually, you should be able to go through your model answers 
and assign points in this manner. To be sure, sometimes a clever 
student will answer in a manner which causes you to change your 
point distribution when you're grading the test. Don't be afraid to 
change if you feel it's necessary. Sometimes those students make us 
teachers feel stupid! But most of the time you will be able to gener- 
ate an explicit model answer with point values assigned. If you can 
do this task in advance, scoring essay questions can become a fairly 
precise operation. Besides, it's the only fair way. 



Questions 

8. Write a general essay question regarding some key concept in chap- 
ter 3 of this book. Be sure to write a model answer as well, and 
suggested point counts for the various parts. 

9. Now break down your question from (8) into smaller, short-answer 
essay questions. Again, be sure to construct model answers and point 
counts for each answer. Which way do you think the item should be 
administered — as one longer question or a few shorter ones? 



At times, though, the type of question you're asking makes the 
concept of a model answer difiicult to follow. When the character- 
istics of the question stimulate a variety of individualized re- 
sponses, no single model answer will suffice. Themes generally fall 
into this class, even when a fairly specific topic is assigned; so do 
questions designed to bring out creative applications of learned con- 



106 Educational Measurement 

cepts or processes. When a model answer with associated point 
totals is inappropriate, use a rating technique. The rating technique 
has these steps: 

1. Score only one item at a time. That is, read all responses to 
the first item before you go on to the second. 

2. Don't just start reading "cold," allowing general impressions 
to govern your rankings. You wrote the question and assigned the 
task — ^why? Were you seeking clarity of expression? An argument 
with a logical defense? A clever or unique response to a question? 
Vivid description and expression? You may not have the same 
purpose for each of the items, which is perfectly all right. 

3. Decide the number of gradations you think you can make. 
Do you think you'll be capable of sorting the responses in three 
distinct groups? in five? Don't make too many or you'll force your- 
self to become arbitrary. Probably the most defensible technique 
for most classrooms would be one of these two: 

a. Three levels (high, average, low) with equal numbers at each 
level. 

b. Three levels of equal numbers, followed by the possibility of 
raising a few from the high group to a "Superior" rating. Likewise, 
a few from the low group would be dropped to a "Poor" rating. 
Roughly speaking, you might want a distribution something like 
this: 

5% Superior 

30% High 

30% Average 

30% Low 

5% Poor 

Don't force these last two groups, though. If no papers really seem 
to fit, have one or both of these groups empty. 

4. After your initial placement of all the papers, glance quickly 
through each stack to see if any paper is misplaced in a very glaring 
manner. 

5. Write the score for that item on each paper and go on to the 
next item. 

Seems like a lot of work? It is indeed. But such a set of proce- 
dures will give you the fairest possible evaluations. If you aren't 
willing to devote the time to a fair and reliable reading of the items 
don't ask them in the first place. 



Questioning Formats in the Classroom 107 

Question 

10. Reread chapter 4. Devise the following questions: 

a. An essay question with a scoring procedure which assigns points 
to different parts of the answer. 

b. An essay question which must be evaluated using a rating 
method. Briefly state the criteria you would use to rate the 
responses. 



Other Supply-Type Questions 

Short Answer Completion. These items are somewhat like essay 
questions except they would concentrate on a single thought or 
concept. Essay answers usually go over fifty words. Short answer 
completions should be one sentence or even just a phrase. Some 
examples: 

When soap is used with hard water, a scum forms. What causes 
this scum? 

What happens to limewater when carbon dioxide is brought near 

it? 

What is the Taxonomy of Educational Objectives? 

Very briefly distinguish between intrinsic and extrinsic motiva- 
tion. 

What classical concept of tragedy is found in the characteriza- 
tions of Leer, Othello, and Macbeth? 

Computation Questions. Items of this type occur most often in 
math and science, but occasionally are found in other areas as well. 
Some suggested rules and regulations: 

1. If the question requires units (feet, tons, dollars), specify 
these clearly. 

2. Where shall the student do the computations? If you want the 
work shown on the test, leave adequate room. If you want it done 
elsewhere, make sure everyone has some paper to work on. 

3. Unless absolutely necessary, avoid putting computational 
problems in a multiple-choice format. A multiple-choice format. 



108 Educational Measurement 

with one correct and three incorrect possibiUties, is a pretty hollow 
substitute for real-life computational problems where one correct 
answer exists amid an infinite number of incorrect ones. 

Completion Questions. Completion items usually focus on one, two, 
or three word explicit responses. If you decide that your students 
should have a specific word, name, or date immediately available in 
memory (without the type of cues provided in a multiple-choice or 
matching exercise), then a completion item should be used. Al- 
though these items seem so easy to write, a mmiber of general rules 
should be followed to insure that the items do get at meaningful 
information: 

1. Wherever possible, ask a question rather than leave a blank 
in the sentence. For example, the question: 

In testing, what is the name for the measure of a test's precision? 

(reliability) 

is preferred to 

is the name for the measure of a test's precision. 



After all, you're asking for information and you might as well come 
right out and ask the question. 

2. If you provide blank spaces for the student's responses, make 
the length of all blanks the same. 

3. Don't just copy a sentence from the text, leaving a key word 
out. Such questions encourage completely superficial study habits. 

4. If an item cannot be stated in a question format, and you feel 
you must leave a word out for completion, at least have the blank 
near the end of the statement. 



Non-Supply Type Questions 

Performance, essay, short answer completion, and computational 
problems are called supply-type items. They have a common de- 
nominator. In each, the student is asked to supply something — a 
bit of information, a thought, a skill — in the answer. The responses 
are not restricted to a choice among a few possibilities. If the 



Questioning Formats in the Classroom 109 

question is on a topic completely unknown to the student, guessing 
will be a little tricky. 

Most of the items appearing in commercially published standard- 
ized tests are not supply-type. True-false, matching, and multiple- 
choice items have a restricted number of response possibilities. 
Rather than recall a skill, some information, or a thought, as the 
student must do with a supply-type item, the task is to recognize 
or select the most correct answer from a limited number of possi- 
bilities. The student is required to discriminate from among a Ust 
of plausible responses the one which is most correct. In a true-false 
item, the number of possible responses is limited to two; the multi- 
ple-choice item may contain up to five choices, while a matching 
item can be constructed with up to ten different possible responses. 
In each case, the response domain is restricted. The skill required 
is recognition, not recall. Don't get the mistaken impression, 
though, that restricted response items are only applicable to lower- 
level cognitive objectives. 

Matching Items 

The matching item format is a popular choice in teacher-made 
objective tests. Its format looks something like this: 

DIRECTIONS: Choose the name from the right-hand column 
which matches the contribution listed on the left side. Use a con- 
tribution from the right only one time. 

J. J. Thomson a. Work with electrolysis. 

Ernest Rutherford b. Discovery of the electron. 



Michael Faraday c. Experiments with alpha 

Svante August Arrhenius 

d. Work with radioactivity. 

e. Developed idea of energy 
levels. 

f. Theory of ionization. 

For a matching item to be appropriate, two conditions should be 
met. First, all of the questions in the exercise should center on a 
single concept. In the item above, the concept involves matching 



110 Educational Measurement 

the scientific contribution with the man. The concept is further 
limited to men who have made major discoveries about the atom. 
Second, your objective should involve words such as "recognize" or 
"discriminate." The matching format is similar to tasks which are 
common in our daily life. A person is not always asked to recall a 
series of responses. Instead, the task is to discriminate among re- 
sponses which are available. The distinctions between competing 
responses by the student can be very subtle. Properly written 
matching exercises can demand a high level of discriminative think- 
ing. Here are some hints on writing matching exercises: 

1. Don't mix ideas. Stick to a single concept in the matching 
exercise. 

2. Do have more responses than needed. If you have a strict 
one-to-one relationship between question and response, the last 
couple of answers become a matter of elimination. Some item 
writers get around this problem by allowing a response to be used 
more than once. Such a format can be confusing because it is 
unfamiliar to the student. If your objective requires that a response 
should be used more than once, look into the "category matching" 
type of items described in the next section. 

3. Do arrange both columns (stimulus and response) in some 
sort of systematic order — such as alphabetically. 

4. Do indicate in the directions if a choice can be used more than 
once. 

5. Don't have more than six to ten questions. Stick with the 
lower end of the range if the questions are fairly complex and move 
up to ten only if the individual questions are short. 

6. Do list in the directions the basis upon which the matches are 
to be made. 

Category Matching 

The student's world is complex and, if any order is to be shaped 
from masses of random data, systems of categorization are neces- 
sary in each content area. People from botanists to political sci- 
entists to newspaper reporters all do extensive categorization. A 
modified matching item is appropriate in cases where you wish to 
assess the student's ability to categorize. Here is an example: 

DIRECTIONS: Fill in the name of the correct poetic device for 
each example. 



Questioning Formats in the Classroom 111 

Alliteration Assonance Metaphor Onomatopoeia 

the ape's old man face 

^_ the clip clop of horse's hooves 

^__^ his glad facade melted past sad 

to mad 
they slept sweetly in sultry si- 
lence 

her eyes were pools of sorrow 

_^___^__ a bald, mauve surface reflected 

the hissing kettle called 



Most of the rules listed for regular matching items still apply. Do 
make the entire exercise center on a single concept. Don't have a 
one-to-one correspondence — have at least one of the categories used 
more than once. Do arrange the categories alphabetically or in 
some other systematic manner. Do list in your directions the basis 
for the matching as well as making it clear that a category designa- 
tion can be used more than once. An additional caution is needed: 
Make sure that the distinction between your items is clear. Are the 
categories really distinct? You must have clearly defined for your- 
self what the distinguishing features of the individual categories 
really are. 



Question 

11. Here is the beginning of a category matching question: 

DIRECTIONS: For each statement below, choose the most 
appropriate category. Put the category's letter in the space in 
front of the statement. 

CATEGORIES: A. The first statement is the cause for the second. 

B. There is no relationship between the two state- 
ments. 

C. The second statement causes the first. 

C 1. The price of chickens in 1972 was up. (first statement) 

The chicken crop in 1972 was very bad. (second statement) 

Now you write four other pairs of statements to fit this format. Use any 
topic, but try to use each of the categories at least once. 



112 Educational Measurement 

True-False Items 

Very few things are absolutely true or totally false in the world of 
compromise where your students will spend their lives. Nonetheless, 
on hearing or reading a particular statement, each person is often 
required to decide "that is generally true" or "that is generally 
false." The goals of your instruction will frequently include situa- 
tions where the student is expected to decide if a statement is 
generally true or false. For such objectives, true-false items are 
appropriate. 

True-false items come in for their share of criticism by measure- 
ment experts. Such criticisms usually mention the lack of discrimi- 
nation of the items (when contrasted with other forms, such as 
multiple-choice items) and imprecision or sloppiness in the con- 
struction of the true-false statement. The goals of this chapter 
include outlining the available item types, specifying when each 
is most appropriate, and helping you develop skills in writing 
acceptable items of each type. This long, technical elaboration of 
the measurement specialists' views on various item types will be 
avoided here. For a careful discussion of the measurement view of 
true-false items, the book by Ebel (1972) is suggested.^ 

Where would true-false items be a good choice? One response is 
suggested above. At all age levels and across subject matter areas, 
certain objectives will center on the student's ability to evaluate 
the accuracy of a statement. In such cases, in school as in life, the 
student will not be allowed to hedge or equivocate. Even when the 
statement is only generally true or generally not true (in either 
case, not absolutely so) the student will be required to put all his 
chips on one of the two squares. If a multiple-choice question were 
used, it would have to include two or three other responses which 
might detract from the primary objective. 

Closely associated with the above motivation for true-false 
items are those cases where the intent is to write a multiple-choice 
question, but only one correct response is really plausible. For 
example: 

A person threatens to injure another person. What is the name of 
this crime? 

assault 

^^^ battery 

kidnapping 

rape 



5. Robert L. Ebel, Essentials of Educational Measurement (Englewood 
Cliffs, N.J.: Prentice-Hall, 1972), pp. 156-71. 



Questioning Formats in the Classroom 113 

The correct response is "assault." The only really plausible dis- 
tractor is "battery." When this item was administered to approxi- 
mately 200 students, the last two responses were never chosen. 
Even if the student does not know the technical meaning of "bat- 
tery" he probably will understand that "threatens" is not the same 
as "battery." Thus, a student who has no real knowledge of the 
meaning of "assault" will score this one correct. On the other hand, 
if the goal is to determine the student's knowledge of "assault," the 
item could be stated: 







The crime of assault occurs when one person threatens to 
injure another. 



Such a statement would be superior to: 

T (F I The crime of battery occurs when one person threatens to 
injure another. 



The second statement has the same problem as the multiple- 
choice item. That is, the student still would not really need to know 
the meaning of assault to score the item correctly; and after all, to 
determine if that knowledge was present was the original objective. 

These are the two general situations where true-false items are 
more appropriate than other types of items. Of course, a well- 
written true-false item can be equally as effective as other item 
types with other objectives. The particular item type you choose 
is frequently a matter of personal preference. Whenever true-false 
items are your choice, consider these suggestions: 

1. Do limit the statement to a single thought or single concept. 
If you include two thoughts, you may put the student in conflict — 
he may think the first part is true, but the second is false. Consider 
this statement: 

T F In arresting a suspect, the police must apprise him of his 
rights, but they can search for a weapon without a search 
warrant. 

The entire statement is true. If a student responds "F" the teacher 
will not know which part needs further instruction. Does the stu- 
dent think both parts are false or just one of them? Better to use 
the question as two true-false items — preferably where one re- 
sponse is true, and the other is false. For example, the item could 
be reworded as follows: 



114 Educational Measurement 

T F In arresting a suspect, the police must apprise him of his 
legal rights. 

T F Before the police can "frisk" a suspect for a weapon, they 
must obtain a search warrant. 

Here is another example: 

T F Those defined as "poor" in this country are more likely to 
be nonwhite than white and more likely to live in an urban 
setting. 

The first half of the statement is false (in terms of numbers, 
there are more white than nonwhite poor in this country) but the 
second part is true. Since one part of the statement is false, the 
most correct response is "F." But this isn't really fair. 

By the way, each part of this example illustrates a case where a 
true-false item is quite appropriate. The actual percentages of 
white and nonwhite poor and for the urban and rural poor may not 
be as important here as the knowledge of the directionality of the 
majority. 

2. Try to avoid using negatives. Consider these items: 

T F The aneroid barometer does not have a liquid (like mer- 
cury) in it. 

T F The aneroid barometer has a liquid in it. 

Now a student who knows that the aneroid barometer has no 
liquid and that the mercury barometer does may still choose the 
wrong response due to the confusion factor in the first item. Such 
an event breaks a cardinal rule of testing: Don't let the format get 
in the way of the student's showing you what he knows. Some 
students are careless readers and will simply miss the word "not." 
But there is a second, and more subtle, problem with the first item. 
The "not" in the statement actually makes the item true. With 
the second statement, the student reads the item and, assuming he 
knows the aneroid barometer requires no liquid, moves directly to 
the "F." With the first statement, however, the student must make 
a positive response (true) to the negative in the sentence (not). If 
the goal is simply to find out about the student's knowledge of 
aneroid barometers, why not ask the question in the most direct 



Questioning Formats in the Classroom 115 

manner? The "double shift" is really a trick. The second item tests 
only the specific bit of knowledge, while the first adds a component 
which might be called discriminative reading ability or just plain 
cognitive power. In the technical sense, the more complicated item 
will probably discriminate better among the high and low per- 
formers, but discrimination between these two groups is usually not 
the goal of the classroom test. Leave the cognitive power segment 
for the aptitude (IQ) tests which will inevitably be administered to 
your students somewhere along the line. 



Question 

12. Take the true-false statement used as a previous example and 
rewrite it as two statements — one true and one false. 

T F Those defined as "poor" in this country are more likely 
to be nonwhite than white and more likely to live in an 
urban setting. 



3. The goal in constructing false statements is to make the 
statement sound quite plausible to a person with an incomplete or 
superficial understanding of the concept at hand. This kind of 
learner will recognize certain key words or phrases but will not 
comprehend specifically how these all fit together. To the super- 
ficial learner, false statements should sound true. 

4. The goal in constructing true statements is to make them 
sound just like the false statements so that the student who has 
not internalized the relationships and concepts will be unable to 
discriminate between the two kinds of statements. The assumption 
being made here is that the classroom test is designed to assess the 
results of some learning task — for example, a unit, a lecture, a 
chapter, or an experiment. Following directly from the assumption, 
then, you would not want to design your questions such that a 
clever student can pick out the true and false items without ever 
really participating in the learning task. As much as you can, select 



116 Educational Measurement 

topics for the true-false items which are relatively unique to the 
specific learning task. Otherwise, as mentioned above in the discus- 
sion of negatively worded items, you will be testing previous learn- 
ing level, cognitive power, and/or reading discrimination — and not 
the objectives of a particular learning task. 

5. Keep the true and false statements approximately equal in 
length. In general, teacher-constructed false items can tend to be 
shorter than true statements. Don't allow this to happen. 

6. Have about an equal number of true and false items. 

7. Avoid simply copying sentences from the text. Too many 
teachers write a quick test by copying key sentences — occasionally 
changing a word to make half of them false. Don't fall into this 
trap. In the first place, you'd be surprised at the minimal number 
of sentences on a typical text page which can be used as true-false 
statements. The second reason is much more important. Such a 
construction practice encourages the wrong kind of studying and 
learning. If the students become accustomed to seeing sentences 
copied directly from the book in the test, they will soon get in the 
habit of putting key statements to rote memory and will not focus 
on the overall ideas, relationships, or applications. 



Questions 



DIRECTIONS: Each of the following true-false items could use some 
improvement. First, briefly state what is wrong with each, then rewrite 
the item correcting the error. 

13. T F Methyl alcohol contains a higher percentage of oxygen and 

therefore more oxidation is possible. 

14. T F Taxation was not a primary reason for the American 

Revolution. 

15. T F Punishment is not an effective method of changing behav- 

ior and is not usually preferred. 

16. T F Athlete's foot fungus feeds primarily on germs and other 

fungi. 

17. T F "If a town is near the ocean, then its average winter 

temperature will be higher than that of an inland town." 



Questioning Formats in the Classroom 111 

18. T F A search warrant has to name the item to be seized and the 
crime committed, but does not name the person to be 
searched. 



The Corrected True-False Exercise 

As mentioned earlier, a major criticism of true-false items is based 
on the high probability of a correct answer by guessing alone. If all 
the students randomly respond to a 100-point true-false test, the 
average score will be around 50. The realistic range of scores on 
such a test spans from 50 to 100. This is not particularly efficient, 
when you consider that a 100-point completion test has a potential 
range from to 100 (if the student can recall nothing, he gets 
none right). On a 100-point multiple-choice test with one correct 
and three incorrect answers for each question, the scores can range 
from about 25 to 100. This reduction in the potential range of the 
scores in a true-false test is not a particularly serious thing with a 
classroom test, since the usual purpose is to assess mastery of cer- 
tain objectives and not to discriminate among the students. How- 
ever, a fairly straightforward technique for "sharpening" the items 
exists. Rather than simply directing the students to mark the items 
"T" or "F," have them also correct the items which they mark as 
false. For example, the item: 

T F The aneroid barometer has a liquid in it. 

could be corrected to read: 

mercury 
T F The aneroid barometer has a liquid in it. 

does not have 
T F The aneroid barometer 4ias- a liquid in it. 

You can score the item in two ways. You could score one point for 
each correct "T" answer, one point for each correct "F" answer, 
and an additional point if the "F" answer is properly corrected. 
Thus, two points could be gained with each "F" item and one for 
each "T" item. Or you could score a point for each correctly 



118 Educational Measurement 

marked "T" item and allow a point for the correctly marked "F" 
item only if the item is properly corrected. The first scoring tech- 
nique seems most logical, since the two acts of identifying an in- 
correct item and then correcting it are somewhat distinct. 



Questions 

19. Write five true-false items. Make one each based on the first 
five chapters of this book. Be prepared to hand these in to your 
instructor. 

20. Get a fifth- or sixth-grade social studies text from the library or a 
local school district. Write five true-false items with one to five 
behavioral objectives linked to them. 



True— False Items and the Cognitive Taxonomy 

Some people have the mistaken notion that true-false items apply 
only at the knowledge level of the taxonomy of educational objec- 
tives. To illustrate that this conception is improper except for the 
one case, the following example has been included. All of the items 
are based on the novel The Catcher in the Rye by J. D. Salinger, 
usually read at the high school level. 

KNOWLEDGE T (f) Mr. Antolini was a teacher whom 

Holden knew from Pencey Prep. 

COMPREHENSION (t) F Holden gives Phoebe his red hunt- 
ing cap so that she will be protected 
as he is. 

In order for the student to answer this, he must be able to interpret 
what purposes the hunting cap serves for Holden, so that he can 
translate Holden's reason for giving it to Phoebe. 

APPLICATION (V) F One of the rules of the game of life for 

Holden is don't hurt people. 

This question requires that the student understand what Mr. 
Thurmer meant when he told Holden that life is really a game. He 



Questioning Formats in the Classroom 119 

must then apply this concept to the particular character of Holden 
and how he approaches the game. 

ANALYSIS T ^Fy Holden never completes his phone calls be- 
cause he can't decide whom to call. 

Here the task is organizing the thought process in order to under- 
stand and analyze the relationships of the parts. The student must 
first find all of the times that Holden thinks of calling someone and 
doesn't. How is Holden feeling at these times? Then he must per- 
ceive a pattern being established. This should help lead him to the 
final conclusion that Holden's uncompleted phone calls are a 
symptom of his insecurity and fear of rejection. 

SYNTHESIS This is the exception! Since synthesis refers to the 
ability to put parts together to form a new whole, the product of 
such a process is difficult to phrase in a true-false format. Below is 
an example of a synthesis level task. 

Keeping in mind Holden's vocabulary, speech patterns, sentence 
and paragraph structure, write a composition in the style of J. D. 
Salinger in which you are Holden Caulfield at the zoo. Include 
Holden's thoughts, perceptions, and reactions to the things and 
people around him. REMEMBER: Holden has a passion for 
exaggeration, physical detail, and is self-analytic. 

The student must bring together many different kinds of cognitive 
information and skills in order to produce the theme. To write a 
paragraph as an example and then ask true-false questions about it 
would not be testing the student's ability to synthesize — only his 
ability to recall and recognize. 

EVALUATION T (Fy J. D. Salinger's purpose was to show us 

that it's not easy to be a catcher in the 
rye. 

The student must understand Holden's dream of being a catcher in 
the rye and its function as an expression of his desire to save and 
protect people from the ugly experiences of life. He would then be 
able to see the statement is false. Holden's final decision is that 
it's impossible to try to save everybody, and that you have to just 
let people grow up and fall down and learn from their experiences. 



120 Educational Measurement 

Question 

21. Using a book or article from one of your major areas of interest 
(your college major or the grade level where you think you might 
some day teach) construct one or two true-false items centered 
around a single theme at each of the six levels of the cognitive 
taxonomy. 



Multiple-Choice Items 

The multiple-choice item involves the stem, in which a question is 
asked or a statement made, and a series of three to five responses. 
One of the responses completes the statement or answers the ques- 
tion better than the others. The responses which are incorrect (or, 
more properly, less correct) are called distractors. Nearly everyone 
is familiar with items of this type: 

Which best describes the volume of one gram of water? 

X a. More than a drop but less than an ounce. 

b. More than an ounce but less than a cup. 

c. More than a cup but less than a quart. 

d. More than a quart but less than a gallon. 



Note that the item does not ask for precise equivalency relation- 
ships, demanding memory of the equivalency of the metric gram 
and the English ounce (ounce being a measure of volume and 
weight in the English system). The objective is to determine if the 
student understands that a gram of water has a very small volume. 
The item provides a good springboard for a series of suggestions 
which are designed to improve your construction of multiple-choice 
items in the classroom. 

1. The item should have one, and only one, answer which is 
clearly correct or clearly more correct than the others. In some 
cases, the item has a correct answer, in which case the direction, 
"Choose the correct answer," can be used. At other times, a ques- 
tion will not have an absolutely correct answer so the direction, 
"Choose the best answer," is more appropriate. 

Sometimes, to save time, you might be tempted to use the direc- 
tion, "Which two of the following are most correct?" Avoid these 



Questioning Formats in the Classroom 121 

kinds of items unless they are absolutely necessary. Students have 
experience with multiple-choice items and most of this experience 
is with the "one choice per item" variety. Selecting one choice per 
item is a habit. If you change the rules in a teacher-made test, 
you're likely to catch a few unwary students who will choose only 
one answer even though they could select the other if they had not 
been tricked by the format. Such trickery, as you recall, goes 
against a cardinal rule of testing. 

2. The distractors you choose are important. If the distractors 
are ridiculous and obviously incorrect, the student will focus on the 
correct response by a process of elimination and not through 
knowledge of the correct response. Here are some hints for con- 
structing satisfactory distractors: 

a. Have the distractors similar in form to the correct answer. If 
the correct answer describes an event, so should the distractors. 
If the correct answer is a prepositional phrase, make the distractors 
prepositional phrases. 

b. The distractors can be opposite in meaning to the correct 
answer, somewhat less precise, or somewhat less complete — all of 
these are satisfactory. 

c. The use of distractors which are true statements but which 
are clearly not the best answer to the question is encouraged. For 
example: 

DIRECTIONS: Choose the correct response. 

If the girl who is our star tennis player does not come, we will 
forfeit the match. 

a. Commas are used to set off an appositive. 

X b. Commas are used after an introductory adverbial clause, 
c. Commas are used to separate nonrestrictive clauses. 
d. Commas are used to separate items in a series. 

d. Have the distractors approximately equal in length to the cor- 
rect answer. 

e. Avoid the use of absolute words hke "always" or "never" since 
these are rarely the correct answer. 

f. Avoid the use of "All of the above" as a distractor. Suppose 
three answers are given and the fourth is "All of the above." The 
student knows he will be marking only one answer. If he knows 
that two of the three distractors answer the question, he can 
reason that the proper answer is "All of the above" without know- 



122 Educational Measurement 

ing anything about the third distractor. This distractor is also 
poor when no single answer to the stem is absolutely correct, but 
even the worst response has some minor semblance of truth. "AU 
of the above" is a technically correct response in such situations. 

g. Use "None of the above" as a distractor very sparingly. This 
response is effective only in the situation of writing multiple-choice 
items for objectives which are basically quantitative. The "None 
of the above" option keeps students from starting with each of the 
four answers given and working backwards. It also tends to make 
the multiple-choice numerical problems more "open-ended" as it 
changes the number of possibilities from three or four to an in- 
finite number. 

But the topic here is tests written by you, the teacher, for use in 
your own classroom. Test publishers artificially force mmierical 
questions into a multiple-choice format because of the scoring 
problems which arise in allowing the student to work the problem 
in a supply-type format. With only thirty or so students, you don't 
have to worry about machine scoring. Use a supply format for 
problem-solving tests in math and science — or wherever a numer- 
ical problem comes up. If numerical problems are never phrased 
in a multiple-choice format, then the major use for "None of the 
above" is eliminated. 

One other point, though: If the directions are "Choose the cor- 
rect answer for each item," and all of the other distractors are 
somewhat correct, then "None of the above" will always be the 
correct answer. In some areas of the instructional program, precise 
answers are difficult to find. Rarely will a single answer be ab- 
solutely correct — making "None of the above" a bad distractor. 

h. In teacher-made tests, avoid using distractors which can be 
eliminated through the student's learning of some other lesson. 
Presumably, you will construct the test to measure the attain- 
ment of objectives in a current learning task. If the distractors 
focus on previous learning, you will be to a certain extent retesting 
material previously evaluated. That's something like double jeop- 
ardy to those students who did poorly on the previous task. 

3. The stem or statement should be in the form of a question 
whenever possible. The goal is to send the student into the four or 
five possible answers with a definite question in mind. Situations 
will certainly arise where a question is clumsy, and some sort of 
sentence completion stem is better. If you have tried to write the 
stem as a question and find the verbiage clumsy, then change; but 
try to write the question first. 



Questioning Formats in the Classroom 123 

Some other suggestions regarding the stem: 

a. If a common thought, adjective, phrase, or clause is used in 
all possible answers, put the common part in the stem instead. 

b. The stem should be able to stand alone. Avoid stems like 
"Eisenhower believed that . . ." or "Music helps students . . ." or 
"Good citizenship requires that people. . . ." They don't stand 
alone and they don't ask a question. If you think a certain prin- 
ciple of Eisenhower's philosophy was important, ask "Which of 
these best describes the Eisenhower philosophy?" Probably you 
would ask such a question only if you could identify a single de- 
fensible "best" answer. The difference in the two questions about 
President Eisenhower should be clear. The first requires that the 
student read four or five statements without any real question in 
mind. The second sends him into the responses with a specific 
question (and answer, if he has fulfilled the objective) in mind. 

c. Wherever possible, avoid negative statements in the stem. If 
you must ask a negative to fulfill an objective, restate it in some 
other way and use a true-false item. 

4. The correct answers should form a random pattern. Some 
teachers, inadvertently, place the correct answer in the third or 
fourth position with great frequency. The alert student will be- 
come aware of this. Try to have each letter or distractor position 
used approximately the same number of times. 

5. Avoid having a dependency between or among items. That 
is, don't make the ability to answer one item a function of the 
accuracy of the answer to the item directly before it. If the stu- 
dent is inaccurate in the first response, he would also get the 
second wrong. Make the items independent of each other. 



Multiple-Choice Items in Teacher-Made Tests 

A glance at the major published tests of achievement and aptitude 
will illustrate that the multiple-choice item is the basic building 
block of this industry. Surely nine of ten items in such tests will 
be in a multiple-choice format. The reasons for the high multiple- 
choice emphasis include ease of scoring, higher discrimination 
power per item, and the ability to measure all cognitive levels with 
a single item-type. The scoring issue is probably the critical one. 
What's good for the testing industry may not be good for the 
classroom teacher. Although individual variations exist among test 



124 Educational Measurement 

publishers, the process whereby they construct their items usually 
goes something like this: 

1. After first generally defining the domain of content to be 
covered, professional item writers generate at least twice as many 
items as will eventually be needed in the test. A "pro" does not 
turn out vast quantities of items. As few as one or two useable 
items per hour would be constructed by one writer. 

2. Next, the items are put through at least one trial. The trial 
consists of administering the item to a sample from the group 
toward whom the test is directed. The sample will consist of hun- 
dreds of such students. 

3. Based on the trial, certain of the items written by profes- 
sionals will be discarded as unsatisfactory and most of the re- 
mainder will be revised somewhat. A few of these will go through 
a second trial. 

4. Finally, those items which meet certain predetermined spec- 
ifications will be included in the published test. 

Now, it is possible for the classroom teacher to write multiple- 
choice items. Tryouts are also possible if essentially the same in- 
structional program is used year after year. However, the motiva- 
tion to write multiple-choice items is clearly not as strong with 
the individual classroom teacher as it is with the test publisher. 
And the classroom teacher also has a sharply reduced supporting 
staff. 

The goal of the above preamble is not to discourage you from 
constructing multiple-choice items. The purpose is to assure you 
that a "good" classroom test doesn't necessarily have to look like 
a published test with the major emphasis on multiple-choice items. 
Choose the item format (performance, completion, essay, match- 
ing, true-false, or multiple-choice) to conform to your objectives. 
Choose the format with which you are comfortable. Let the ob- 
jectives, coupled with your own personal preference, govern the 
format of the test. Don't let the "tail wag the dog." 



Questions 

22. For each of the following, first outline what you think is wrong 
with the item, then rewrite the item to eliminate the error. 

Before a policeman can search a house, he must obtain a 

a. warrant. 

b. good laugh. 



Questioning Formats in the Classroom 125 

c. treaty. 

d. All of the above. 



The probability of pulling a white ball from a bag containing 3 
white balls and 6 black balls is 

a. the product of 3 and 6 

b. 3/10 

c. 7/2 

d. 4/5 

The atmosphere contains more carbon dioxide now than it did 50 
years ago. What is causing this increase? 
a . The increased phosphate pollution. 

b. The use of pesticides and herbicides which is necessary 

for farming. 

c. Widespread use of fossil fuels. 

d. Increased numbers of bacteria. 



A defendant's right to counsel means that 

^a. he is awarded $1000 to hire a lawyer. 

b. a psychiatrist can be consulted. 

c. he cannot be his own lawyer. 

d. None of the above. 

According to the above definition, "judicial review" does not mean 

a. 2 weeks later the decision is checked. 

b. the Supreme Court decides if an act of Congress is con- 
stitutional. 

c. lower courts can appeal. 

H. a legal journal printed semi-aimually. 

Which of the following characteristics define a desert? 

a. Lots of sand. 

b. Measuring the amoimt of rainfall in a year. 

c. Last course in a meal. 

d. Number of camels per square mile. 

23. Find a textbook or article in a social science area. Write a be- 
havioral objective and an accompanying multiple-choice item for 
at least five of the six levels of the cognitive taxonomy based on a 
chapter in the text or article. 



Summary 

This chapter has covered the majority of measures that are used 
in classroom evaluation. They range from the typical paper-and- 
pencil supply-type items through performance and unobtrusive 



126 Educational Measurement 

measures. If these last are in less common usage, it is no doubt 
because the necessary extra time and materials make them less 
appealing than a mass-administered paper-and-pencil test. Teach- 
ers should be aware, however, that with a task analysis, selection 
of key skills, careful planning of the particulars of the testing situ- 
ation, and an objective scoring system, a performance measure can 
be constructed to reach efEectively and efficiently those skills which 
paper-and-pencil can't reach. 

For objective items in general, there are some considerations 
which should be kept in mind. Their purpose, of course, is essen- 
tially to make your task, as well as that of your students, easier. 
These include using objective items to get at higher level skills, 
deciding number, type, and answers for items beforehand, and 
keeping item-types and content areas grouped. Arrange items 
from easy to more difficult and don't change columns in the middle 
of an item or response. Dealing with the specific kinds of options 
oJBEered by completion items, matching, computations, or category 
matching, remember to make your directions clear and use alpha- 
betizing as a means of ordering choices. 

The popular essay question serves its purpose when, like other 
formats, it is geared to your objectives. Its greatest asset is the 
potential for higher level questioning. Instead of one general ques- 
tion, use a series of shorter essays. Have a correct answer written 
out with point allocation in advance. Avoid making the student 
choose one question from a number of options, as that only makes 
his task more difficult by adding one more decision. 

True-false and multiple-choice items are also popular. In order 
for them to be "good" measures, caution must be used in con- 
structing them. Both types should deal with a single concept per 
item, avoid negatives, keep the length of statements about equal, 
and have false statements or distractors sound as plausible as true 
statements or correct answers. Try to have an equal number of 
true and false statements in an exercise, avoiding sentences copied 
from the text. The scoring problems related especially to the 
corrected true-false item should be considered beforehand. 

There are some particular warnings unique to multiple-choice 
items. Avoid absolutes such as "always," "never," and "all of the 
above." Use "none of the above" as an option rarely. Any dis- 
tractors that are included must relate to the present task and 
should be similar in form. The stem should be a question that can 
stand alone and not a modified form of completion item. Avoid 



Questioning Formats in the Classroom 127 

dependency between items so that a student who lacks one bit of 
information is not penaUzed in succeeding questions. 

The most important thing that any test writer should remember 
is his final objective. This wiU guide any selection of item type 
and format, as well as direct the content of questions. Whether it 
be a performance measure or essay exam, choose the format which 
you find comfortable and that best suits you and the interests of 
your students. 



Classroom Applications 



Enough "talking about" classroom testing. The time has come to 
get specific. In this chapter, two applications of the exhortations 
introduced in the first five chapters will be presented. As you will 
see, the two examples are placed in widely disparate classroom 
environments. After the examples, a "little" assignment will be 
outlined for you. The assignment is named a "mini-internship." 
As you might expect, your task will be to identify a setting where 
some learning task is under way, and do an evaluation at that 
site. You will have to identify the instructional program, write the 
objectives and the evaluation items, clear all of these with the 
teacher in charge of the learning task, and administer, score, and 
report the results of your evaluation. This will all be explained at 
the end of the chapter. 

First, though, the two appUcations. 

APPLICATION I 

The setting is a tenth-grade English class studying a unit on 
poetry. The school is in a stable, integrated community which is 
approximately twenty percent nonwhite. The community takes 
its high school seriously. A substantial majority of graduating 
seniors move directly into college. In no way, however, can the 
student body be viewed as homogeneous. The socioeconomic 
levels of the students' parents vary widely, as do the perform- 
ance levels of the students in any given classroom — especially a re- 
quired subject like tenth-grade English! 

128 



Classroom Applications 129 

Specificity and Acceptance of Objectives: The task of specifying 
objectives for tenth-grade EngUsh had been undertaken during the 
previous summer by a committee of English teachers in the high 
school. A few important objectives were stated in performance 
terms. These were considered to be imperative by all the teachers 
in the writing group. In addition, a longer hst of suggested terms 
and concepts had been enumerated. Not all were translated to 
performance language with specific goals superimposed. Disagree- 
ment existed over the importance of the objectives on the second 
list. Thus, the list was compiled, and the teachers were allowed to 
select objectives from it according to their own personal prefer- 
ences. Two examples from the first list (considered by all to be 
very important objectives) were: 

1. The student should be able to recognize all of the metaphors 
in a given paragraph or poem. 

2. Given a paragraph or poem, the student should be able to un- 
derstand the comparisons in each metaphor, and demonstrate 

that understanding by writing a statement like " is 

being compared to " for each of the metaphors. 

The second list (where disagreement existed over the importance 
of the objectives) included these statements: 

1. Given a list of brief poems, the student should be able to 
distinguish those illustrating patterned metrical feet from 
those using free verse. 

2. Given examples of poems in patterned metrical feet, the stu- 
dent should be able to name the meter (e.g. iambic, trochaic, 
dactylic, etc.). 

A certain amount of disagreement existed among the teachers 
regarding the extensiveness of the coverage of some of the concepts. 
A few felt that recognizing examples of the important terms was 
satisfactory. Others urged that the student be asked to construct 
a written explication of the poem. The explication would be ex- 
pected to include various poetic techniques used to support the 
student's interpretation of the poem. Such demands, of course, 
involve more than mere recognition of examples of terms. The 
student would need writing skills of organization and paragraph 
construction, as well as analytic thought processes to fulfill the 
objective. 

Comprehensiveness of the Measure: A classroom teacher ap- 
proaching the poetry unit has included all of the objectives listed 



130 Educational Measurement 

in the first "imperative" list, plus selected statements from the 
second. Those included in the second list reflected the teacher's 
own personal biases. The list which was assembled was not par- 
ticularly extensive so that one or more evaluation devices could be 
assembled for each objective on the list. Thus, under the COM- 
PREHENSIVE heading in Table 4 of chapter 4, response number 
1, indicating evaluation of every objective in the list, is appropriate. 

Choice of Item Types : The teacher has four sections of this par- 
ticular class with an enrollment of about 120 students. A little 
extra effort in the test construction process will make the scoring 
task much easier. Wherever possible, objective items requiring 
recognition skills are placed in a multiple-choice format. The in- 
terpretation items are broken into a series of brief, short essay 
items with prewritten scoring keys. One matching exercise is used, 
linking poetic types with examples of each. True-false and the 
short essay items are saved for the last part of the test. 

Test Administration Decisions: Remember that some of the ob- 
jectives being evaluated reflect the teacher's own personal biases. 
To a certain extent these selections are based on the teacher's likes 
and dislikes, but they are also based on the perceived abihties of 
the students. It is already known that a wide range of individual 
differences exist in this classroom. If the teacher is truly sensitive 
to these differences, a variable administration schedule should be 
implemented. Little can be gained from asking John to write a 
composition giving his interpretation of a particular poem if he 
has demonstrated that he is unable to even recognize a hyperbole — 
let alone comment on how it contributes to the overall theme of the 
poem. For students like John, the items concentrating on knowl- 
edge and recognition of terminology will be used exclusively. On 
the other hand, some students have grasped the terminology 
quickly, and can be expected to relate the particulars of a poetic 
technique to the author's overall purpose. Some of the knowledge 
and recognition items will be dropped for such students and re- 
placed with items emphasizing interpretation questions. 

Of course, the practical problem of preparing 120 individualized 
tests is apparent. The teacher decides to compromise, and con- 
structs three tests. The first covers a limited number of objectives, 
the second a few more, and the final test all of the objectives on the 
list. One mistake to avoid in such planning is that of always giving 
the same students the same level of test. Such behavior in essence 
"pigeonholes" a student into a particular level. If you test in this 
manner using hierarchical test levels, be sensitive to all your stu- 



Classroom Applications 131 

dents. During a two- or three-week unit, the task of fairly accu- 
rately identifying the approximate performance level of 120 stu- 
dents is not completely overwhelming. To be sure, you'll make a 
few mistakes — administering a too-difficult test to some, and a 
too-easy test to others. Such mistakes will be neither extensive nor 
personally devastating. 

The teacher wants to move the class along together, and does 
not want one group interpreting poetry while the next group under- 
takes a research assignment. Thus, a second administrative deci- 
sion is that all shall be tested at the same time. 

Level of Expectations of the Students: The teacher reasons this 
way: "I have taken the time to write tests which, to the best of my 
ability, reflect the general performance level of each individual. I 
am going to expect each student to accurately complete 80 to 85 
percent of the test which is administered to him." 

Reporting the Results: The teacher feels that knowledge and 
interpretation of poetry is a personal sort of thing and that relative- 
ranking scores are inappropriate. Knowing how John's score com- 
pares to Pete's is unimportant. Since the test is criterion-refer- 
enced, with one or more items attached to each objective of the 
learning task, a checklist kind of reporting seems very reasonable. 
The teacher provides a checklist of objectives for each student, 
indicating which the student had attempted and, of those at- 
tempted, which had been completed in a satisfactory manner. If 
you ever report scores to students in this manner, it might be well 
to couple the list distribution with an announcement that those who 
had not attempted an objective, and who wanted to do so, could 
make arrangements with you. 

Summary: Can you see how the evaluation process is integrally 
entwined in the instructional process? Specifying objectives clari- 
fied to both you and the students what the purposes of your in- 
struction were. Both testing and reporting were easier once the 
objectives were enunciated. Rather than ignore obvious individual 
differences, the teacher overtly took them into account in both 
testing and reporting. 

Some very personal philosophical viewpoints enter into testing 
decisions. Must reporting always be competitive, showing how each 
student compares to the others or to some norm group? The deci- 
sion is up to you — you have a choice. Must all students be evalu- 
ated with precisely the same items? Is it fair to ask for more 
performance from one student than from another? The decision is 
up to you, and your choice reflects the way you view the world. 



132 Educational Measurement 

What, after all, is the primary mode whereby a student is evalu- 
ated? Is it through the use of standardized test batteries? No! The 
bulk of the time spent evaluating a student is in a classroom by a 
teacher using a teacher-made test. Your decisions about testing 
and reporting are very important in the way that student perceives 
himself. Don't drift into your evaluation. Plan in a very systematic 
manner. 

APPLICATION II 

The setting is a self-contained fifth-grade classroom. The topic is 
science, and in particular, a chapter in the science text dealing with 
electrostatics. The school is centered in a suburban district and 
primarily serves children whose parents are apartment dwellers. 
The population is somewhat transient, although about one-third of 
the families are permanent residents in one-family dwellings. 

The unit has three general purposes. First, it will introduce the 
idea of two kinds of charges on materials, along with the concept 
that these types of charges interact with each other in special, 
predictable manners. Second, the student will do a considerable 
amoimt of laboratory work with the two kinds of charges, using 
rubber and glass rods rubbed with wool and silk, respectively. The 
charges from these rods are then transferred to balloons and an 
electroscope. All of this laboratory work is designed to develop the 
laws of attraction and repulsion with charges. Finally, the imit 
attempts to make future learning about electrostatics easier, the 
laboratory learning is to be translated into drawings, showing 
positive and negative charges on materials. After all, illustrating a 
concept with a drawing is easier than having the child actually 
perform the experiment — once the experiment has been used to 
develop the basic notions. 

Specificity and Acceptance of Objectives: Most of the objectives 
of an elementary school science program are amenable to transla- 
tion into behavioral language. To add frosting to the cake, usually 
very little serious disagreement is found among science teachers 
regarding the importance of a concept like electrostatics. In this 
day and age, a good elementary school science series should have 
the objectives for each chapter listed in behavioral language. To be 
sure, some of the broader (and more important) objectives like 
"appreciate the scientific method" or "develop critical thinking 
skills" are a little tricky to state behaviorally, but the bulk of 
objectives for each chapter could be written. Even if the textbook 
series has fallen down at this point, a committee of teachers in the 



Classroom Applications 133 

district could remedy the situation by four to six weeks of effort in 
the summer. The assumption that all objectives are stated in be- 
havioral language, and that the objectives are generally acceptable 
to all is a reasonable one. 

Comprehensiveness of the Measure: The list of objectives is not 
particularly long. The preparation of one or more evaluation items 
for each objective is a reasonable goal. 

Item Types: To evaluate the laboratory work, performance mea- 
sures are necessary. In the laboratory, the student was to have 
learned to place a positive or negative charge on a rod; to transfer 
a specified charge to a balloon or an electroscope; and to place the 
charged rod in an insulated swing for use in carrying out experi- 
ments. In addition, when another student charged a rod or a 
balloon with a charge unknown to a second student, the second 
student was expected to carry out certain tests to determine which 
charge, if any, was on the balloon. Each time the student carried 
out an experiment, he was expected to make a drawing of the 
equipment, showing where charge was accumulating and showing 
the polarity of the charge with "+" or "— " markings. 

Multiple-choice tests are a puny substitute for sparks flying be- 
tween balloons. If the goal was to learn in the laboratory, why not 
test them right there? A performance test is clearly indicated for at 
least some of the objectives. 

The teacher prepares a structured series of tasks to reflect the 
objectives. At one station the student would be asked to place a 
plus or minus charge on a rod; at a second, to illustrate with two 
balloons the principles of attraction (like to unlike, or any charge to 
a neutral charge), and the principle of repulsion. Another station 
might have an "unknown" charge on a balloon — a plus, neutral, or 
minus charge. The student would have to conduct an experiment to 
find the unknown. Five or six stations might be enough for the 
performance part. A checklist should be at each station, naming 
the objective and listing the students' names. As each student 
passes the station, the operator checks "mastered" or "not mas- 
tered" by the objective. 

Too much trouble? You can't be at all stations at the same 
time? Where can you get some help? You could use one or two of 
your students who have mastered the objectives already as your 
helpers. Avoid using the sam.e students all the time, though. If 
other fifth graders are in the same building, borrow a few students 
from that teacher — manning a station would be an excellent learn- 
ing experience for them. Older students are generally more than 



134 Educational Measurement 

willing to help out. When a written test is clearly inappropriate and 
the objectives indicate the necessity of a performance measure, find 
a way to evaluate in that manner. 

The second set of objectives centers on the drawings (showing 
+ s and — s) which the student made while observing the experi- 
ments. He should also be able to interpret drawings which you 
made which illustrate electrostatic principles. The teacher carries 
out some demonstrations at the front of the class and has them 
make drawings to take care of those objectives. To measure the 
ability to interpret electrostatic drawings, multiple-choice or true- 
false items are appropriate. 

Test Administration Decisions: The teacher decides that in the 
area of science the information is basic. The concepts are not ex- 
tremely difficult. Although the students have a range of abilities, all 
are within a "normal" range, and can be expected to fulfill all of the 
objectives. To individualize instruction so the fastest learner is not 
bored while the slower ones master the concepts, the teacher has 
the faster ones do other projects during some science periods, help 
out as tutors for classmates, prepare the evaluation stations, and 
administer some of the evaluations. The goal, then, is that all stu- 
dents master all objectives. The testing will be done for all the 
students at the same time. 

Level of Expectation for the Students: As mentioned above, all 
students are expected to fulfill all objectives. 

Reporting of Results: All students will at least attempt all of the 
objectives. A comparative reporting system seems needless. The 
teacher provides a checklist of objectives to all students, indicating 
which have been satisfactorily mastered. If the parents of the stu- 
dents demand a grade (usually it's the parents, not the students 
who want a grade) tell them that any student mastering 85 percent 
or more of the objectives receives an A and aU the rest receive an F. 

Summary: By the advance planning of objectives, the necessity 
for performance measures was made apparent. A performance 
measure is a very real and concrete thing. You can use it to stimu- 
late interest. Planning the entire evaluation program in advance 
helps focus your instruction on areas which you have decided are 
most important. The decision to bring all students to mastery on all 
objectives is risky and challenging. However, if the objectives are 
considered fundamental for future learning, it is a commitment you 
must make. A distribution of scores on the electrostatic test would 
be unsatisfactory. All A's is not just a desirable outcome. It is the 
only acceptable outcome if the objectives are that important. 



Classroom Applications 135 

A Mini-Internship in Classroom Evaluation 

Evaluation devices prepared by the teacher for use in a particular 
classroom have been the central theme of the entire book up to this 
point. The information which follows in later chapters is necessary 
general background information if you are effectively to use and 
interpret tests written by others, and if you are to participate in 
evaluations not necessarily limited to a single classroom. Before 
moving on to these new topics, a task designed to see how well you 
can put into practice what has been presented up to here is appro- 
priate. Reading about evaluation is one thing; putting the informa- 
tion into practice is a bit different, as you will surely discover in 
carrying out the assignment which follows. 

The assignment can be carried out under a variety of conditions. 
Your instructor might assign it as a course paper. You might, at 
some later date, use the outline as a credit-producing independent 
study. You might undertake the task on your own on a noncredit 
basis, just to have the opportunity of putting your learning into 
practice — and finding out if you like this kind of work. The basic 
assignment has seven stages, which will be described in some detail. 
Some other options, for selection by you or your course instructor, 
are presented after the six basic steps. 

Fundamentally, the task involves identifying some educational 
setting where a test will be required at a time when you can 
administer it, writing the objectives and the test, administering and 
scoring the test, reporting the results, and describing the entire 
process in a paper. The seven specific steps are as follows. 

Stage 1: PERSONAL ORIENTATION Stage 2 will describe 
kinds of sites you might use for your testing project. But before 
going on to this phase, it would be well (if only for your own peace 
of mind) for you to decide the kind of evaluation you would like 
to carry out. Are your thoughts more directed toward elementary 
school or high school? More toward a science than a social science 
or language arts? Do your feelings seem to run toward criterion- 
referenced testing or normative testing? Would you like to con- 
struct performance tests? Devise unobtrusive measures? How much 
time are you willing to spend? If the content over which you will be 
evaluating is presented in a straight lecture format without an 
outside text, you will simply have to attend the class for all the 
lectures — otherwise, how can you write a test? 

The assignment might be a good time to break down certain of 
your biases. Have you ever felt that certain multiple-choice tests 



136 Educational Measurement 

you have taken were too memory-oriented — too "Mickey Mouse"? 
This is a golden opportunity for you to try your own hand at the 
task to see if you can concentrate a multiple-choice test on the 
higher cognitive levels. If you do end up teaching, you'll do a lot 
of testing in your career. Experiment a little now. 

Stage 2 : SELECTING A SITE Now it may be that finding a 
location to administer a test will be so difficult that you cannot 
pamper your own personal wishes. When this assignment has been 
tried out with other students, however, the task of finding a testing 
location has not been a difficult one at all. By and large, instructors 
at all levels are more than happy to have someone construct an 
evaluation device for them. 

Only a minimal number of restrictions will be placed on the kind 
of site you can use for this assignment. You must be able to admin- 
ister the measure in time to allow you to report on it before the 
end of the reporting period. The evaluation device must be con- 
structed for an identifiable body of students. That is, a question- 
naire to be administered to a random sample of some larger 
population is not satisfactory. Generally speaking, a class of stu- 
dents under the direction of a teacher or teachers wiU be most 
satisfactory. The teacher or team of teachers must be willing to use 
your evaluation device (after inspecting it, of course) as a part of 
the course evaluation. Aside from those, the field is wide open. 

Where can you look? The answer to that question depends a 
good deal on your personal orientation and your personal contacts. 
If you're interested in nonreaders, consider nursery schools, kinder- 
gartens, or day-care centers. These evaluations, of course, would 
probably need to be given individually, since the children cannot 
read and have little experience with group testing. 

Performance tests? Consider science classes at all of the public 
school levels. Or math classes would do — especially if the topics 
involve some aspect of geometry. The vocationally oriented classes, 
like woodworking, home economics, typing, auto repair, driver 
education, etc., are excellent sites for performance tests. 

But if you're interested in trying your hand at paper-and-pencil 
tests all you need to keep in mind is that the students must be able 
to read. That leaves the field wide open from third grade through 
graduate school. 

Maybe you have a special interest — a religious class, private 
music students, a special interest group outside of the public 
school, educationally handicapped children, or students participat- 
ing in some sort of physical activity. Maybe you'll even want to 



Classroom Applications 137 

evaluate the informal learning that goes on in an identifiable group 
of students. 

Worried about finding a site for your evaluation? Think first of 
the contacts you have already made before approaching strangers. 
Have you done any observing in classes up to this point? Can you 
recall a teacher in high school or college with whom you established 
good rapport? Have you ever discussed classroom evaluation with 
any of these teachers? Do you do any tutoring, special teaching, 
part-time work in an educational enterprise of some kind? Any of 
these should lead to a solid contact. One bit of information should 
be made clear at the outset: You are equipped at this point to 
write a good evaluation for some teacher — a hell of a good evalua- 
tion, if you put your mind to it. That teacher isn't doing any kind 
of one-directional favor for you by allowing you to write and 
administer an evaluation. Most teachers simply don't take the time 
they know they should with the evaluation process. All you need to 
do is convince the teacher that you are capable of writing a good 
evaluation and the task of obtaining cooperation should be mini- 
mal.i 

Don't just walk up to a teacher (even a teacher who is a personal 
friend) and say "Hey, I have to write a test. Can I use your class?" 
Nobody wants a rank amateur tampering with their instructional 
process. Instead, be prepared to talk about informing yourself of 
the instructional program, translating general objectives to behav- 
ioral language, writing criterion-referenced tests, test administra- 
tion schedules, variable reporting systems, and any particular 
innovations that you would like to try out. You would be wise to 
develop your proposed plan carefully before approaching the 
teacher. Who knows — if your sales pitch is good enough you might 
be able to compose the mid-term or final exam for a class in which 
you are currently enrolled! That way, you could complete this 
assignment and miss one final exam all at once. 

Stage 3: BECOMING COMPLETELY FAMILIAR WITH 
THE LEARNING TASK YOU WILL BE EVALUATING The 
assigmnent will not be fulfilled properly if the classroom teacher 

1. Two undergraduate students in a class taught by the author had no 
personal contact to approach in order to fulfill this assignment. They decided 
to try a local high school to find a teacher. They simply walked into the 
teacher's lounge and began explaining the task to the first teacher they could 
comer. Before ten minutes had gone by, they had at least six other teachers 
willing to be the site for the testing. As soon as you can convince the teacher 
that you are capable of providing a valuable service, your site-location prob- 
lem is solved. 



138 Educational Measurement 

tells you what to put in the test. You must take it upon yourself 
to identify all of the elements of the learning task which you will be 
evaluating. Is the task primarily textbook-oriented? Are there 
films, lectures, experiments, field trips, discussions, outside reading, 
reports, or group meetings which are part of the task? You cannot 
possibly evaluate any learning task until you have become familiar 
with every facet of it. 

For practical reasons, then, a one- or two-week learning task 
would probably be a good choice. In most cases, you will want to 
attend the class while the learning task is being presented and a 
longer period of time could become burdensome. 

Stage 4: TO THE EXTENT POSSIBLE TRANSLATE THE 
LEARNING TASK INTO BEHAVIORAL OBJECTIVES Don't 
expect the teacher to handle this phase for you. In fact, don't work 
with the teacher until you have first drafted the list alone. Consult 
the teacher only after you have translated the task into behavioral 
language to the best of your ability. 

That seems backwards, doesn't it? You say the teacher knows 
what the objectives are so why not just ask? This is the interesting 
part of the assignment. You, an outsider, are coming into the class 
not to observe teaching methods, classroom control, or any of the 
routine things observers usually do. Instead, you are observing the 
learning task and trying to identify what the objectives seem to be. 
On what is the teacher concentrating? Which parts of the task 
consume the largest proportions of time? You shouldn't be evaluat- 
ing what the teacher says are the objectives of the learning task. 
You will be identifying the objectives which are the primary ones, 
based on your observations. 

In defining your objectives, consult once again the techniques 
provided in chapter 2, but also take chapter 4 into careful con- 
sideration. Be prepared to discuss with the teacher the concept of 
specificity and acceptability of the objectives. Do you think you 
can translate all of the important expected outcomes of the learning 
task into behavioral language? If not, do you think your list at 
least "catches the spirit" of the entire task — such that the student 
who accomplishes your list of objectives has also probably mastered 
the entire learning task? Do you want to evaluate every objective 
on your list, or will you be doing some sort of sampling process? 
Don't break up an individualized instruction program just for your 
own convenience — that's testing at its worst! If the students are 
completing the various objectives at different time intervals, make 
arrangements to do your evaluations as they complete the task. 



Classroom Applications 139 

Will all of the students be expected to complete all of the objec- 
tives? Or will you vary the items presented as a function of the 
number of tasks the student can reasonably be expected to com- 
plete? Do you expect every student to master every objective? 

All of these matters must be considered as you prepare your list 
of objectives. Decide the answer to each question on your own, 
then discuss the decision with the teacher. If your reasons for want- 
ing to handle the task in a certain manner are rational you should 
have Uttle problem convincing the teacher to cooperate with "your 
way." Review the discussion in chapter 4 carefully as you answer 
each of the questions. The decisions you make are important. Don't 
make them capriciously. 

The final act of this stage is to sit down with the teacher and go 
through your list of objectives. Discuss your various test adminis- 
tration decisions. Make arrangements for the time or times when 
you will administer the evaluation. 

Stage 5: PREPARING THE ACTUAL EVALUATION DE- 
VICE "Evaluation device" has been used in lieu of "test" because 
you might just be making unobtrusive observations, preparing a 
performance measure, or doing individual interviews with little 
children instead of writing the more familiar paper-and-pencil test. 

In this stage you have the task of choosing the item formats for 
your evaluations. You can choose from the pool of formats pre- 
sented in chapter 5: Performance items, unobtrusive measures, the 
supply-type items and the nonsupply type items. In addition, if 
you want to work ahead, chapter 11 is devoted to the measurement 
of attitudes, and also concentrates to a certain extent on a struc- 
tured interview approach to evaluation problems. 

This assignment provides a learning experience for you, so don't 
concentrate your entire evaluation on a single item type. If the 
range of objectives will allow, try to use a variety of different for- 
mats. Run the entire range if you can. 

You might want to add a step at this point. Make an agreement 
with some other member of the class to exchange tests for critical 
comment. Sometimes a person gets so close to a task that obvious 
errors are overlooked. 

Stage 6: ADMINISTRATION, SCORING, AND FEED- 
BACK Make arrangements so that you can administer the test. 
Some of your best learning will occur as you watch the students 
trying to puzzle their way through your "pearls." Administering 
the test is an important phase, so don't pass this task off to the 
teacher or a friend. As quickly as you possibly can, evaluate the 



140 Educational Measurement 

results of your work. Feed the results back to the students in 
the manner you had determined earlier. Will it be a checklist? 
Norm-referenced scores? Do you show them the range? Do you 
have specific suggestions for further work? Again, do this part of 
the task in person. Keep notes on the feedback you get from the 
students. Did they think your evaluation device was fair? Was it 
too hard? Too easy? Incomplete? You usually won't have to ask. 
Most students are happy to give their impressions of a test! 

Stage?: FINAL REPORT The final report should be ahnost 
like a diary. The rationale for each of your decisions should be pre- 
sented — briefly. How did you make your contact? What are the 
phases of the total instructional task? List your objectives. Outline 
any points of disagreement that you had with the teacher with 
whom you worked. Show how you translated the objectives into 
various item formats, and defend your choice of formats. Show the 
actual test. Include a section of things you observed as you admin- 
istered the test and fed the results back to the students. Present a 
summary of your results in some sort of tabular manner. End the 
report with comments on how you might improve the process next 
time around. 

Possible Variations to the Assignment 

In the chapter which follows, a number of measurement topics will 
be presented. These include reliability, validity, standard scores, 
and standardized tests. Each of these could be added to the assign- 
ment. You might devise techniques for determining reliability or 
validity coefficients for your measure. The scores might be trans- 
lated into one of the well-known standard score scales before re- 
porting to the students. You could search for a standardized 
measure to administer in conjunction with your "home-made" test 
and have a comparison of the results from the two. 



How Wm We Know 

When We Get There? 

Standardized Measures 



What time is it right now? You check your watch — maybe look for 
a clock. Is your watch correct? Who set the clock? To set a clock a 
person usually listens to the radio. "At the tone, the time will be 
8:30." How does the radio announcer determine the exact time? 
Can't his clocks run fast or slow as well? 

Near Washington, D.C., a radio station WWV operates twenty- 
four hours a day at a variety of frequencies. Station WWV and 
Station WWVH, which operates near Puuene, Hawaii, are both 
operated by the National Bureau of Standards. By operating at 
extremely precise frequencies, these two stations keep other sta- 
tions calibrated to the proper frequencies. This helps avoid a lot 
of confusion. The two stations also give the precise time every 
minute. The time on these stations is the standard for the country. 
"This clock is ten minutes slow." Compared to what? The official 
standard of comparison is WWV. 

Laura takes a test. She answers 80 percent of the questions 
correctly. What does that mean? If the test is criterion-referenced, 
it means she has mastered 80 percent of the objectives. If the test 
is not criterion-referenced, but instead is a random or biased selec- 
tion of questions from some loosely defined domain of content, then 
the interpretation is like a strange clock. You can read the face of 
the clock just as you can determine the number correct on the test, 
but you don't quite know what either means. The test needs a 
WWV. The nearest thing available are test standardization pro- 

141 



142 Educational Measurement 

grams. A test which has gone through a stsindardization program 
is called (are you ready for this?) a standardized test. 

What is a standardization program? What tests are standard- 
ized? What are the differences between standardized and unstan- 
dardized tests? With a purpose in mind, how can you find out if a 
standardized test exists to fit your needs? How can you choose the 
proper item format in a standardized test — a format which fits 
the needs of your students? Why is it that some people are so 
hostile toward standardized tests? Answers to these commonly 
asked questions are the foundation of this chapter. 

What is a standardization program? Laura, you recall, answered 
80 percent of the items on a test accurately. One way to interpret 
Laura's performance is to compare it to the performance of other 
people like Laura on the same test. Then Laura can say, "Hey, 
that's great because the average person like me scores only 50 per- 
cent correct!" Or. '^Manj_^am_I slow! Most pegBlg ^fee me get 9 5 
percent^ correct.''|iA test stan3i53izaiEion"program focuses on thS" 
Plike m?^doubiS^ That is, the publisher first identifies the popula- 
tion for whom the test has been written. A carefully cho sen sample j 
pf scores from this target population is accumulatedJLaura and 
others then compare their scores withtho'se in the typical sample. 

The target population might be public school pupils in the 
United States at a particular age — say ten years old. Some tests 
are standardized on college undergraduates; psychological tests are 
standardized on adults who have some specific, diagnosed mental 
illness. Standardization samples for interest inventories are people 
in various occupations or people with specific hobbies. For a voca- 
tional battery, the standard is frequently the performance of people 
currently filling specific vocational positions. 

The most commonly known standardized tests are the achieve- 
ment batteries written for pre-coUege students. You will be in a 
better position to interpret the test results of a pupil in your class- 
room if you have a good conceptualization of the standardization 
procedure for that test. The score from a ten-year-old in your 
classroom is compared to "typical" ten-year-olds. Who makes up 
that "typical" group? 

Standardization procedures vary, of course, but a good publisher 
will try to control for area of country, size of city, and socioeco- 
nomic level. If 13 percent of the ten-year-olds in this country are 
from the Northeast, living in cities of 50,000 to 200,000 population, 
and are white, the publisher will try to select 13 percent of his 
standardization sample from that group. If only 0.5 percent of the 



Standardized Measures 143 

population are black, rural children living in the Northwest, then 
only 0.5 percent of the sample will be chosen with those character- 
istics. Roughly speaking, a publisher tries to sample about 1 percent 
of the students at a given age level. This translates to about 40,000 
students per grade level. The 1 percent figure is frequently opti- 
mistic. For a given nationally standardized test of achievement, 
you can figure the norms for one age or grade level are based on 
between 5,000 and 40,000 pupils. For the norms to be useful to you, 
however, the "typicalness" factor is actually more important than 
is the size of the sample. If all 40,000 ten-year-olds were chosen 
from New York City, the comparisons aren't going to mean much to 
a teacher in Lone Tree, Iowa. A carefully chosen sample of 5,000 
could remedy that. 

In identifying the standardization group, the test publisher does 
not choose 40,000 students directly. Instead, the unit of selection 
is the school district. The school districts are chosen at random to 
conform roughly to the desired sample characteristics. A child is 
only tested if he happens to be in one of the selected school 
districts. 

A significant problem in estabhshing national norms is gaining 
the cooperation of the districts which were "lucky" enough to be 
selected for the standardization sample. Administering a battery of 
tests to students takes valuable time away from the instructional 
program. Someone must monitor the tests. The results of the tests 
will eventually come back to the district, but probably too late for 
use in the instructional program. Usually, the test publisher tries 
to give something in return, such as free scoring for some number 
of years, a reduced rate on future tests, or possibly some sort of 
good public relations. Sometimes cooperation is very difficult to 
obtain. Unfortunately, real bias can creep into the list of those who 
cooperate and those who do not. Such bias can bring the "typical- 
ness" of the sample into question. 

For example, suppose School X is located in an inner city area. 
Year after year, the principal of the school has seen how his pupils 
compared badly with the standardization sample on every test. 
Who needs still another kick in the teeth? School Y, though, in a 
nice suburb of the same city, regularly receives its pat on the head 
with the return of the yearly achievement test results. Don't you 
suppose the second school will be more likely to cooperate? Some- 
times gathering the proper number of minority group schools from 
low socioeconomic areas of large cities is quite difficult. The pub- 
hshed norms will reflect that bias. Because of this, the publisher 



144 Educational Measurement 

usually won't be anxious to make the precise distribution of the 
standardization sample known to all users. 

Why is information about the standardization of a test needed 
by the classroom teacher? The answer centers on the word "typi- 
cal." The score received is a comparison of your student to the 
"typical" student. If your student is markedly "atypical," it fig- 
ures that you shouldn't attach too much significance to the 
comparison. The standard is wrong for your class or for some stu- 
dents in the class — they need a different WWV. Do some of your 
students use English as a second language? The typical student 
certainly doesn't. You'd better find some standard based on stu- 
dents who use English as a second language. To be sure, no student 
is totally typical; but for the majority of schools and students, the 
norms can be used nicely, as the deviation from typical is minor. 

How do standardized tests differ from unstandardized tests? 
Generally speaking, published tests are usually standardized, and 
standardized tests are usually published. The two go together. The 
largest category under the heading of unstandardized tests are 
those tests made by a teacher for use in a particular school. How 
does this group — the teacher-made tests — differ from the stand- 
ardized tests? 

The differences fall under three general headings: specificity of 
purpose, effort in preparation, and test interpretation procedures. 
Consider these differences under each heading: 

STANDARDIZED TESTS TEACHER-MADE TESTS 

1. SPECIFICITY: The stan- 1. SPECIFICITY: Teachers 

dardized test is fairly general, write tests which are specific to 

dealing with a broad area of a learning task. Commonality 

content. The coverage concen- with other classrooms is not im- 

trates on those elements of an portant. 

instructional area which are 
common to large numbers of 
school districts. 

COMMENT: Test publishers want to sell their tests. That's 
why they are published in the first place. If the test content con- 
forms to the curriculum in a minority of the schools, not many 
sales are possible. Thus, there is a necessary concentration on 
topics which are common to most schools. That's one reason why 
"new math" concepts were not included in standardized tests for 
such a long time. As long as a substantial nimiber of districts 
were not basing their programs on the new math concepts, the 



Standardized Measures 



145 



publishers had to avoid these kinds of questions. After all, districts 
with the new approach could still use tests without such items, but 
districts which had not adopted the new math would obviously 
show up poorly if the items included extensive use of the new 
terminology. 



2. EFFORT IN PREPARA- 
TION: Content and measure- 
ment experts spend an extensive 
amount of time surveying the 
field and perfecting the test. 
Generally, these experts try to 
follow relatively scientific test 
construction procedures. 

3. TEST INTERPRETATION: 
The raw scores (number right or 
wrong) are rarely reported. In- 
stead, some number showing how 
a student compares to the "typi- 
cal" student is the basis for inter- 
pretation. This requires the stan- 
dardization sample. 



2. EFFORT IN PREPARA- 
TION: Given the press of time 
from other required activities, 
the teacher can hardly be ex- 
pected to match the effort which 
goes into a published measure. 
Probably, the majority of these 
tests will be used only once. 

3. TEST INTERPRETATION: 
If the test is criterion-referenced, 
the interpretation is on the basis 
of attained objectives. Tests 
which are not criterion-refer- 
enced are usually interpreted 
through within-class comparison. 
The scores are arrayed in order 
and "curved." Some districts still 
maintain a percentage scale (e.g. 
93%-100% is an A) . 



Published tests, then, are written for areas of general and wide- 
spread interest, involve a lot of preparation effort, and are generally 
interpreted in the comparative or normative sense. How else are 
teacher-made classroom tests different? 

The teacher-made test allows for more flexibility in administra- 
tion. For example, the published tests will be diflBcuIt to administer 
at different times in the same classroom — primarily for test- 
security reasons. The teacher-made test can be altered somewhat, 
as described in chapter 4, so that it can be administered at the time 
each student fulfills the required objectives. It also can be used as 
a take-home exam. Published tests are not designed for this type 
of usage. 

Published tests are usually not criterion-referenced. The test 
manual will describe the general coverage, but this information will 
rarely be in terms of specific behavioral objectives. This rules out a 
criterion-referenced reporting system. Teacher-made tests can be 
criterion-referenced whenever the criteria can be specified. 



146 Educational Measurement 

When a test writer begins to ply his trade for a test publisher, 
certain artificial parameters must be dealt with. The professional 
test writer must obey rules which the teacher, writing for a few 
students, can ignore. The published test must be easily scored, 
demanding the use of objective items, primarily multiple choice, 
and it cannot have any local idioms. The questions must discrimi- 
nate. Discrimination in the testing world means that the items 
cannot be so easy that everyone gets them correct, nor so difficult 
that no one can answer. The writer of a teacher-made test has no 
such limitations. 

All of these limiting factors on the professional writer are artifi- 
cial in the sense that they are not really a part of the major task 
of the test items. The test items are supposed to find out how a 
particular student handles a particular task. Period. Scoring con- 
siderations are artificial, as are format requirements, discrimination 
needs, and common language specifications. 

The point is not that standardized tests are therefore bad and 
teacher-made tests good. The point is that there is no reason for 
teachers to conform to the same set of rules which govern profes- 
sional test writers. Teacher-made tests will probably look different. 
The results will be reported with systems unfamiliar to the world 
of published tests. Remember that you, the teacher, live in an 
unrestricted world as far as test writing is concerned. The profes- 
sional test writer has the restrictions. 



Questions 

1. Find a technical manual which accompanies some standardized test. 
Write a brief report describing the test standardization program. How 
many were in the sample? How were they chosen? Is the test pub- 
lisher evasive about the standardization program? 

2. In what ways do you think the high school you attended was not 
typical of the standardization sample for a nationally standardized 
test? Do you think the difference between your high school and the 
national group is enough to make the national norms invalid in that 
situation? 



How can you find a standardized test to fit a particular purpose? 
Before identifying the specific sources which are best to use when 
seeking a standardized measure, a preliminary point needs to be 
made. Look at the two little "blobs" in figure 4 — mathematicians 



Standardized Measures 147 

call them sets. One includes all the situations where teachers, ad- 
ministrators, school psychologists, researchers, project directors, or 
evaluators might need some sort of measure. The other includes all 
of the published tests available at this time. The question is: Which 
is the subset of which? Are there more pubhshed tests than there 
are situations to use them? Or are there more measurement situa- 
tions than there are pubhshed tests? 



All of the different \ [ The purpose for^ 

situations where some I \ which all of the 

sort of measure is 7 \ standardized and 
needed. / \ published tests have 

been written. 



Figure 4 

The answer is pretty obvious. Consider these measurement situa- 
tions, each of which is realistic: The measurement of French 
grammar achievement for eight-year-old girls with a hearing defect; 
or, the measurement of attitude toward educational psychology in 
a private university. Why, generally, are tests standardized and 
published? Primarily to make a profit. Would a standardized mea- 
sure in these two specific areas be profitable? No. The population 
for potential sales is limited. 

Clearly, far more situations exist where a measure is needed and 
a standardized test is not available. The two important rules to use 
when considering a standardized test are: First, decide what it is 
you are trying to measure — in behavioral terms if possible. Other- 
wise, at least know the general direction. Then, look for a standard- 
ized measure. Don't let the complexion of your task be hmited by 
the available standardized measures. Define the problem, then look 
for the test. If one is available — great! If not, you'll just have to 
develop your own measures. 

As you begin your search for a standardized test, the best start- 
ing points are the Mental Measurement Yearbooks.^ The first of 

1. O. K. Euros, The Seventh Mental Measurements Yearbook, two vol- 
umes (Highland Park, N.J.: Gryphon Press, 1972); O. K. Euros, The Sixth 
Mental Measurements Yearbook (Highland Park, N.J.: Gryphon Press, 
1965); O. K. Euros, The Fifth Mental Measurements Yearbook (Highland 
Park, N.J.: Gryphon Press, 1959); O. K. Euros, The Fourth Mental Mea- 
surements Yearbook (Highland Park, N.J.: Gryphon Press, 1953); O. K. 
Euros, The Third Mental Measurements Yearbook (New Erunswick, N.J.: 
Rutgers University Press, 1949); O. K. Euros, The 1940 Mental Measure- 
ments Yearbook (Highland Park, N.J.: Gryphon Press, 1941); O. K. Euros, 
The 1938 Mental Measurements Yearbook (New Erunswick, N.J.: Rutgers 
University Press, 1938). 



148 Educational Measurement 

these volumes appeared in 1938 and the seventh in 1972. In this 
series of books, Euros has tried to include information on as many 
published tests in English as he can accumulate. The information in 
these volumes is organized first according to obvious categories Uke 
Enghsh, inteUigence or vocational tests. General information about 
each test is also presented — information such as number of levels, 
general coverage, different forms, publisher, and suggested age 
ranges. Also included in the description is a partial list of research 
projects which have used the particular test as part of the project. 
Many of the tests are critiqued, usually by an expert in the test 
area. Some of the more commonly used tests have numerous 
critiques. The Yearbooks are not completely hierarchical; critiques 
which appeared in the sixth yearbook are not usually repeated in 
the seventh. If you are looking for critiques of an older, more 
established test, you would be wise to check all of the yearbooks 
which appeared after the publication date of the test. 

A second major source of test information includes the publishers 
themselves. The Mental Measurements Yearbooks do not, of 
course, contain a copy of each test. If you identify a test in the 
Yearbook which might be what you are looking for, the next step is 
to write to the publisher for a copy of the test. Sometimes they will 
send these free; at other times you may have to buy a specimen 
set. A specimen set, which can be purchased for about a dollar, 
contains a copy of the test, an answer sheet, a scoring key, and 
descriptive information about the test. The addresses of publishers 
can be found in The Mental Measurements Yearbooks. 



The World of Standardized Tests: A Summary Outline 

Most educators do not have extensive personal experience with the 
world of published tests. Most personal experience has been as a 
test-taker and not as a test-c/iooser, test-administrator, or test- 
interpreter. To bring the entire field of published tests into focus 
for you at one time, a comprehensive outUne will be developed in 
this section. The skeleton outline below is followed by an elabora- 
tion of each item. 

THE WORLD OF STANDARDIZED MEASURES 
I. MAXIMUM PERFORMANCE TESTS 

A. Achievement vs. Aptitude: An operational distinction 

B. Diagnostic Uses 



Standardized Measures 149 

1. Readiness: Is a student ready for some experience? 

2. Mastery: Which concepts require more learning? 

3. Problem Identification: What are specific problems 
of students who are not "making it"? 

C. Survey Uses 

1. Achievement Batteries 

2. Scholastic Aptitude Tests 

3. Vocational Batteries 

D. Single Purpose 

1. Subject Area Specific Tests 

2. Specific Vocational Measures 

II. TYPICAL PERFORMANCE MEASURES 

A. Measures of Vocational Interest 

B. Measures of Attitude 

C. Measures of Personality Constructs 

III. ITEM FORMATS IN STANDARDIZED MEASURES 

A. Stimulus Modes 

B. Response Modes 

C. The Interaction of Stimulus/Response Modes and the 
Student 

The title includes the word "measures" rather than "tests." The 
rationale for the choice of "measures" also underlies the distinction 
between major heading I {Maximum performance tests) and II 
(Typical performance measures). A test is a measure, but a mea- 
sure need not be a test. When you administer an achievement test 
to a student, your expectation is that the student will do his very 
best. The maximum number correct will be obtained with maximum 
effort. You expect that the student will not try to get a certain 
number wrong. 

On the other hand, suppose you wanted to find out the amount 
of interest each student has in becoming a policeman. You are not 
asking, "How much interest could each student possibly have if he 
really tried," but instead, "What is the usual or typical amount of 
interest each student has." You would not really be testing to check 
on the maximvmi performance. You would, instead, be measuring 
to determine the usual level of interest. 

The maximimi performance tests, as the outline shows, center 
mostly on the cognitive domain. A few of the tests are in the 
psychomotor area. Typical performance measures are primarily in 



150 Educational Measurement 

the affective areas and include topics like interest, attitude, and 
personality. 

User's Hint No. 1: For one reason or another, some students do 
not give their "maximum" on a maximum performance test. If you 
suspect this is true, interpret the results with care. 

When would a student do less than the best job possible? One 
reason might be lack of experience. A student who is less familiar 
than his peers with things Hke item formats, separate answer 
sheets, or long periods of testing will be less successful than he 
ought to be. The test situation gets in the way of this student 
showing his true relative position. You will receive inaccurate in- 
formation about his true performance level. 

Maximum performance demands a high level of motivation. Tak- 
ing a test is about as much fun as a tetanus shot. Something like 
thirty minutes to two hours of concentrated effort (mental and 
physical) are not usually the sort of relaxation one seeks out. But 
all of your students will need motivation. Doesn't everyone have 
the same problem? 

Suppose you develop your own answer to the question by reading 
about these pairs of students: 

Clint and Clyde are both in seventh grade. Ever since first grade, 
Clint has been a consistently high performer. He usually ranks at 
about the ninety-fifth percentile on achievement batteries. His 
name gets mentioned whenever the "brain-o's" are discussed by 
classmates. Clyde is an average performer. He's a likeable guy and 
gets a lot of personal satisfaction out of social interaction. Gen- 
erally, his test scores range in the middle percentiles. Question: 
Who, in the past, has been most often rewarded for working hard 
during intensive testing periods? Who will then work harder in the 
future? 

Laura's father is a professional man. He has a masters degree 
from a prominent university from which Laura's mother also gradu- 
ated. Achievement is important in this family — at the office, in the 
neighborhood, and even in the family's athletic outlets. Laura's 
school performance is a frequent dinner topic. Plans for her enroll- 
ment at a university had been made almost at birth. Lynn's family 
is equally well-off financially; however, the family is far less aca- 
demically oriented. Her father and mother met in high school and 
neither went further. The family money is earned from a small 
ventilation contracting firm. Lynn's possible college attendance has 
not been discussed. Question: Which of these two has more of a 
reason for maximum performance in the test situation? 



Standardized Measures 151 

Tom's grandfather came to the United States from Europe at age 
twenty. He never graduated from any school, and one of his goals 
in life was that his children would do so. Tom's father did complete 
high school and attended a business school after graduation. The 
grandfather's hope was realized. The added schooling helped Tom's 
father have a far easier life. It is not surprising that Tom's father 
feels the same way about education for Tom who will go to college 
— maybe even graduate school. For many generations, immigrants 
have found that learning helps one advance in the social strata. 
Tim's line of ancestors goes back to a boat from Africa in the early 
1800s. His grandfather was a great baseball player who, unfortu- 
nately, preceded Jackie Robinson. His father was a college gradu- 
ate, but the degree never helped him get more than laborer jobs. 
An uncle was a teacher who never seemed to get the choice assign- 
ment or advancements. For many generations, immigrants from 
Africa have found education to be of minimal help in the fight for 
social and economic advancement. Question: Which of these two 
will see more long-term rewards in high level effort on the maxi- 
mum performance test? 

Clearly, the levels of motivation in each pair are not the same. 
When you are administering a standardized test to the students in 
your classroom, be sensitive to these differences. Try to find a way 
to motivate all equally — you know better than anyone what re- 
wards will be most effective. Be cautious of interpretations in cases 
where you think grossly different motivational levels are involved. 



Questions 

3. Each of the three pairs of students described centers on a somewhat 
different variable. What are the variables — that is, the bases for the 
primary differentiation in each of the pairs? 

4. Within which of the three pairs (Clyde-Clint, Laura-Lynn, and 
Tom-Tim) do you think the motivational difference to do well in a 
test will be the greatest? Why? 



User's Hint No. 2: Students have been known to give their 
maximum when you ask for typical performance. 

Professor Simple gives a course evaluation sheet on Monday. The 
students are asked to be quite candid, and they are also asked to 
sign the sheets. Wednesday is the deadline for the final paper in the 



152 Educational Measurement 

course, and Thursday is the final exam. In other words, Professor 
Simple will have the course evaluations in his hands as he assigns 
the course grades. Will all students give a typical course evaluation? 
Will they respond in a manner to reflect their usual feelings about 
the course? Surely, the intimidation will cause some to frame the 
course evaluations in the best possible light. 

A job applicant at IBM would undoubtedly try to show the 
maximum possible level of interest in computers — rather than his 
typical level. A psychiatric patient would want to show the maxi- 
mum amount of normality on some personality measure, rather 
than respond with his typical level of paranoia. 

Be cautious when interpreting typical performance measure re- 
sults. Only in situations where the test-taker sincerely wants to 
show you his usual or normal feelings are the results meaningful. 
Generally speaking, if the test-taker wants to fake a typical per- 
formance measure, the task can be accomplished. 



Achievement vs. Aptitude: An Operational Distinction 

The first distinction which must be made under the maximum 
performance test heading is between achievement and aptitude 
tests. Traditionally, achievement tests have been defined as mea- 
suring what the student has learned up to the moment of testing. 
Aptitude tests, on the other hand, were said to measure the stu- 
dent's ability to learn in the future. The traditional definitions 
imply a clear distinction between the two kinds of tests. The author 
believes the traditional definitions are unhappy ones. An opera- 
tional definition, based on test use, is preferred. This distinction is 
as follows: 

If you administer a test and use the results to determine the 
student's current performance level, you have used it as an achieve- 
ment test. If you use the results to make some sort of decision 
about the student's future (assign him to a reading group or en- 
courage him to enter a college), you have used it as an aptitude 
test. 

Achievement and aptitude tests are not all that different. A study 
of the actual test would show that in format and content they are 
quite similar. This similarity should not be too surprising, for how 



Standardized Measures 153 

could anyone possibly predict what a student can do or learn with- 
out measuring what he has done or learned? You must measure 
current achievement level to predict future performance level. If a 
student has achieved nothing, you will be forced to predict he never 
will achieve anything. In other words, a test title which includes 
words like "aptitude" or "intelligence" is really a measure of cur- 
rent performance level, as are tests with "achievement" in the title. 
Tests with the word "achievement" in the title can be used to 
measure current performance or predict future learning. Tests with 
the words "aptitude" or "intelligence" in the title can also be used 
to measure current performance or future learning. 

Don't let the title of the test tell you if it is an achievement or 
aptitude test. First see how the results are used. Then make the 
distinction based on use. 



Question 

5. A sixth-grade teacher receives the results of a standardized mathe- 
matics test for the students in the room. 

a. Name one use the teacher could make of the results which clearly 
makes the test an achievement test for that purpose. 

b. Name two uses the teacher could make of the results which clearly 
make the test an aptitude test for those purposes. 



Diagnostic Uses: The Readiness Tests 

Recently, the author was notified that he had "won" four days free 
food and lodging at a resort in Nevada. Unfortunately, air fare 
from home (Chicago) to Nevada was not included. Plane fare was 
an important prerequisite for enjoying the "prize." A student 
teacher at a nearby middle school recently told of a student who 
was being offered a "prize" to learn elementary algebra in eighth 
grade. Unfortunately, the student had not quite mastered certain 
incidental math skills like the multiplication tables and long 
division. In an analogous manner, he also was short "plane fare." 

If (a) a learning task requires certain requisite skills; (b) you 
are capable of defining those skills; (c) the skills are amenable to 



154 Educational Measurement 

measurement; (d) the measurements show that a particular stu- 
dent does not possess the skills; then (e) it seems ridiculous to start 
the student on the task. Surely, attempting a task for a year and 
failing is far worse than never having attempted it at all. First fail- 
ure makes the second attempt at learning needlessly hard. 

Readiness tests are designed to help avoid unnecessary early 
failures. The most prominent of the readiness tests are those which 
are directed toward reading. The law requires that all children 
start school at age five or six. Some children are ready to read at 
three. Others are not ready at six. "Ready" means only that the 
child has the requisite skills so that he is at least capable of learn- 
ing. What are the skills? 



Questions 

6. Name at least three skills that a child must have before he can be 
expected to learn to read. 

7. For each skill named in the question above, think of some method 
whereby the presence of the skill in a child could be measured. 

(Do these two questions before moving on to the next paragraph.) 



Learning to read becomes much easier if the child is capable of 
distinguishing among various sounds made by the teacher. For 
example, the long and short "a" or "o" sounds are similar enough 
to require fairly good auditory discrimination in order to distin- 
guish between them. How would you measure this? Well, you might 
say two sounds, one at a time, and ask the child to respond "same" 
or "different," sometimes saying the same sounds and sometimes 
giving different ones. 

Even more important to reading is visual discrimination. Any 
child who looks at "b" and "d" and sees no difference, or thinks 
"man," "mam," "nan," and "nam" are all the same is in big 
trouble with the first reader. The prognosis for such a child's read- 
ing success will be pessimistic. How can you measure visual discrim- 
ination? Show, in pairs and triplets, letters and letter combinations 



Standardized Measures 155 

which most frequently cause problems. In some questions the child 
will be asked to mark the one (if any) which is different. Other 
times, the task is changed to require the selection of two letters or 
letter combinations which are aUke from among four possibilities. 

Auditory and visual discrimination are absolutely necessary. 
Some other aspects are also helpful. For example, a certain amount 
of verbal comprehension is required. The beginning readers usually 
link pictures to words. If a child cannot link "man" or "boat" or 
"crying" to the appropriate pictures of these three, then the next 
step of linking the printed word to the concept is going to be tricky. 
Of course, the child is a little ahead of the game if his elders have 
hammered the alphabet into his head, so that "E" and "e" bring 
about the verbal response of "ee." A peripheral necessity is hand- 
eye coordination. Many of the worksheets the beginner will be 
doing require a certain amount of copying, circling, x-ing, and 
drawing. While completing worksheets is not a necessity to reading, 
they do make teacher and parent happy and that usually leads to 
making the child happy. 

A variety of reading readiness tests are available to measure 
these skills. The Mental Measurements Yearbooks are the best 
sources of information about these tests. If you are on a test selec- 
tion committee, choose three or four likely tests from the Year- 
books, then write the publishers for specimen sets. Most teachers, 
however, are not test selectors, but instead are result interpreters. 

User's Hint No. 3: Don't view a child's test as a single general 
measure of reading readiness. The test will have different specific 
sections. Look at each section specifically. Find out which requi- 
sites the child has and which are missing. Treat the missing skills 
before you start the reading instruction. There's something criminal 
about starting a child on a task when you know he cannot possibly 
succeed. 

Other readiness tests are geared for older students. A number of 
algebra readiness tests have been developed to help administrators 
place students in either algebra or some other math course. The 
military uses Morse code aptitude tests. Prospective flyers can take 
a spatial perception aptitude test. Voc .tional placement batteries 
often include typing, keypunch, or switchboard aptitude measures. 
All of these are primarily readiness tests. Each measures to see if 
the potential learner has the necessary skills for the eventual learn- 
ing of the task. 



156 Educational Measurement 

Questions 

8. Harold takes the XYZ Reading Readiness test which says he is 
ready to read. At the end of first grade, after a full year of instruc- 
tion, he has not learned to read. Speculate on some possible reasons. 

9. Suppose you were to construct an algebra aptitude test. What 
requisite skills would you measure? Try to construct some sample 
items. 

10. Based on the distinction given in the previous section, should 
readiness tests be categorized as achievement or aptitude measures? 

11. As a special project, construct a brief reading readiness test which 
contains 

a. ten auditory discrimination items; 

b. ten visual discrimination items; and 

c. ten items linking pictures to words. (The child sees the picture 
and you say three words, one of which is correct.) 



Diagnostic Uses: Mastery Tests 

A readiness test would be administered to all students prior to 
programming them into a learning task. Diagnostic mastery tests 
would also be administered to all students, but the emphasis would 
be on the question, "Which topics need further attention with each 
individual?" rather than, "Is this individual capable of success in 
this learning program?" A sixth-grade arithmetic mastery test 
might include sections covering factors, fractions, decimals, geome- 
try, ratio and proportion, graphs, and measurement concepts. Possi- 
bly at the beginning of the year, the teacher would administer this 
battery to all students. The results would help the teacher plan an 
individualized program of instruction. 

Mastery tests designed to diagnose specific areas of weakness 
should really be criterion-referenced. Although a few such tests are 
available (and can be located by checking the appropriate subject 
matter titles in Mental Measurements Yearbooks), published 
criterion-referenced diagnostic tests for normal students are not 
numerous. Hopefully, with the increased emphasis on behavioral 
objectives and criterion-referenced tests, this paucity will evapo- 
rate. Specific topics in the physical and biological sciences, social 
studies areas, and language arts topics all could use comprehensive 
criterion-referenced diagnostic batteries for normal students. 



Standardized Measures 157 

Diagnostic Uses: Problem Identification 

Readiness tests are designed for administration to all students prior 
to entering an instructional program. Diagnostic mastery batteries, 
once again, are to be administered to all to determine which specific 
topics need additional effort. A wide variety of published diagnostic 
tests have been developed for the student who has attempted some 
learning task without experiencing success. Such tests are usually 
not administered to all students. Generally, a diagnostic test of this 
type would only be administered when a student has an unexplain- 
able and unexpected problem completing some learning task. 

This type of diagnostic test will differ from the more common 
achievement batteries in fairly predictable ways. The achievement 
batteries are designed for all students. The problem-identification 
diagnostic tests are designed primarily for students who have not 
experienced success in attempts at some learning task. This second 
group will not include the higher academic performers, so the items 
in a problem-identification diagnostic test will tend to be less diffi- 
cult to sample generally pupil performance in a broadly defined 
area. It doesn't get too specific. The problem-identification test is, 
by its own nature, very specific and covers particular conceptual 
breakdowns. The diagnostic test should provide a variety of sub- 
scores, where each subscore is based on a substantial number of 
items. The general achievement test will rarely have more than one 
or two items directed toward any specific objective. 

The potential market for an elementary school achievement 
battery is four to five million students per grade level. A diagnostic 
test attuned to some specific learning disabihty will have a poten- 
tial market only fractionally as large. Given the profit motive of 
most publishing houses, it is not surprising that achievement bat- 
teries seek out the market, but the educator must actively seek a 
problem-oriented diagnostic test. To find such a test, the best 
starting place, once again, is Buros's handbooks. However, many 
tests not included in the handbook are available from nonprofit 
organizations. School districts, universities, and governmental agen- 
cies operate centers designed for specific remedial purposes. A 
speech, learning disabilities, reading, or hearing clinic may be lo- 
cated near you. Such centers usually are repositories for diagnostic 
tests developed by others. Usually, the staff of these centers has 
developed testing sequences of its own. If you have need for a 
diagnostic test for some specific purpose and cannot locate one 



158 Educational Measurement 

through Buros's handbook, your next move is to contact the pro- 
fessional staff at an appropriate remedial center. 

User's Hint No. 4: The subscores in a diagnostic test are based 
on more items than would be found for a specific objective in a 
survey battery, but they still tend to be somewhat unreliable. Don't 
make any earth-shatteringly important decisions about a kid based 
on the score from one subtest. Try to obtain evidence from other 
sources as well. The subscore is just one author's attempt to diag- 
nose a problem. Use the subtest scores to identify problem areas 
for further study and thought. If your considered personal percep- 
tions are in distinct disharmony with the diagnostic test results, be 
sure to pursue the discrepancy until you are in a position to dis- 
credit one or the other. Don't take a test score as an absolute truth 
at the expense of your own professional opinion. 

Survey Uses: The Achievement Batteries 

A battery of tests will include a measure in most of the instruc- 
tional areas of a school's curriculum. A typical elementary school 
battery, for example, contains somewhere around ten subtests cov- 
ering the four major curriculum areas of reading, language arts, 
arithmetic, and study skills. For example, the Iowa Tests of Basic 
Skills^ contains the following subtests: 

TEST V Vocabulary 

TEST R Reading Comprehension 

TEST L Language Skills 
L-1 Spelling 
L-2 Capitalization 
L-3 Punctuation 
L-4 Usage 

TEST W Work-Study Habits 
W-1 Map Reading 
W-2 Reading Graphs and Tables 
W-3 Knowledge and Use of Reference Materials 

TEST M Mathematics Skills 

M-1 Mathematics Concepts 

M-2 Mathematics Problem Solving 

The Iowa Tests are fairly typical of the five or six major achieve- 
ment batteries used in elementary schools. 

2. A. N. Hieronymus and E. F. Lindquist, Iowa Tests of Basic Skills 
(Boston: Houghton Mifflin Co., 1971). 



Standardized Measures 159 

Why use a battery? Why not just go buy a variety of achieve- 
ment tests? The answer goes back to the clock problem introduced 
at the beginning of this chapter. If each state had a station WWV 
(e.g. Minnesota had WMIN, Illinois had WILL, and California had 
WCAL), clocks wrould still differ because WMIN, WILL, and 
WCAL may not be precisely in harmony. Likewise, if the Vocabu- 
lary Test is normed on one sample of 40,000, the Reading Compre- 
hension Test on a second random sample of 40,000, and the Spelling 
Test on a new group, the comparisons would be difficult. If a 
student scores at the fiftieth percentile in vocabulary, the fiftieth 
in reading, and the sixtieth in spelling, each comparison would be 
to a different group. Is he a relatively higher performer in spelling 
than reading and vocabulary, and approximately equal in the last 
two? Remember that the standardization sample contains only a 
relatively small proportion of all students in the country. Maybe by 
chance, the vocabulary sample contained a lot of high performers — 
more than the reading sample. Maybe all schools on the West coast 
refused to participate in the standardization of the spelling test. 
The point is, if you don't have the same standardization sample, 
relative comparisons are fairly meaningless. A strength of the ele- 
mentary school batteries is the use of the same standardization 
sample for all subgroups. 

As you can see from the list of subtests, the Iowa Battery con- 
centrates on skills and not content or information. Each skill is 
introduced in the test at about the same time most school districts 
introduce the skill into the instructional program. Some districts, 
however, do not conform to the pattern of the majority. If a child 
is tested on a skill before it is introduced to him, the result will be 
predictable. This is not to suggest that a district should make its 
instructional program conform to the sequencing chosen by test 
developers. Common sense in interpretation is suggested, however. 
In fact, consider carefully the following hint. 

User's Hint No. 5: Before you try to use the results of any 
standardized achievement test with your students, study {not just 
look at) the content of the test very carefully. 

A careful study of the test content will tell you which skills you 
can expect your students to have and which have not yet been 
introduced. 

A few elementary achievement batteries include subtests over 
specific content areas such as science and social studies. Be sure to 
differentiate between a test of basic skills and a test over a content 
area. The basic skills should develop in the student almost regard- 



160 Educational Measurement 

less of the particular books, techniques, or sequencing in the dis- 
trict.' The content is very much a function of the book or series 
used. One social studies series may be law-focused, whereas a 
second will stress economics. The information presented might be 
substantially different. User's Hint No. 5 should be applied espe- 
cially if your students are taking a content area subtest. You really 
can't expect them to know things which haven't even been pre- 
sented yet. 

A survey achievement battery is somewhat analogous to an 
annual twenty-minute health check-up by a physician. Both are 
superficial. Neither is particularly diagnostic. If the physician finds 
some questionable symptom in the check-up, further specific and 
in-depth examinations will be prescribed. The achievement battery, 
if used properly, can also identify questionable symptoms. Given 
peculiar achievement results, the teacher should try to find the 
problem, rather than just treating the symptom. Mary scores mark- 
edly lower in arithmetic problem solving than in paragraph mean- 
ing; John's relative performance has slumped sharply from fourth 
to fifth grade; Clint's test performance is out of tune with your 
perceptions of him from his classroom performance. Why? The 
battery has simply identified the problem areas. Now the teacher 
should do deeper probing. This could be done with a more exhaus- 
tive achievement test or a home-made test. Perhaps systematic 
observations of these three students would be in order. 

For instructional purposes, the best time to administer the bat- 
tery is in early fall. This scheduling allows the results to be used to 
plan each student's program for the rest of the year. Regardless 
of when the test is given, the results must be returned quickly to be 
of any value. Frequently, teachers must wait four to eight weeks 
after the testing to see the results. This detracts seriously from the 
test's usefulness. If a scoring service cannot provide two-week 
service, the author suggests that the district seek another, or use a 
local data-processing center. Admittedly, most teachers are not 
responsible for setting the test dates or choosing the scoring service, 
but administrators often respond to agitation from the teaching 
staff. Tests administered in mid-September and results available in 
early October would be of the greatest value to the teacher. 

High school achievement batteries are not used as extensively as 
are those in the elementary school. Whereas a nine-year-old student 
in almost any district in the country is dealing with skills in read- 
ing, language arts, arithmetic, and study skills, program differentia- 
tion is common at the high school level. Some students follow a 



Standardized Measures 161 

college-prep sequence; others stick to vocational courses. One state 
requires U.S. History in the junior year while another places this 
course at the freshmen level. 

A sharp dichotomy is impossible, but two philosophies tend to 
permeate high school survey batteries. Extension of the basic skills 
concept to higher levels of cognition dominates one school of 
thought. The Iowa Test of Educational Development (ITED) 
characterizes this group. Subtests in the ITED have titles which 
indicate the stress on interpretation, understanding, and use of 
information in context. Rather than concentrating on the content 
of any particular instructional program, these tests focus on the 
more general thinking processes. As you might predict, the reading 
load in certain of the subtests is pretty heavy. Thus, a student who 
seems to have good thought processes but poor reading skills will 
not do well on the test. 

The second approach to a high school achievement battery 
centers on the content areas of the high school curriculum. The 
Tests of Academic Progress'^ are illustrative of this philosophy. The 
subtests in the TAP include the traditional subject areas of social 
studies, composition, science, reading, mathematics and literature. 

Pay particular attention to User's Hint No. 5 if your school uses 
a content-specific battery. The discussion following the hint be- 
comes particularly important again. The subtest in social studies 
only samples the entire field. The curriculum materials in your 
school may not have the same orientation as those upon which the 
test was based. Then also, information in some areas, particularly 
science, tends to become out-dated. The period between initial item 
construction and appearance of the item on the published test can 
range as long as five years. In a fast moving field, five years is long 
enough to date an item. 

Despite the heavy reading load, the group of tests which centers 
on interpretation, understanding, and use of information does at- 
tempt to measure the higher level objectives of most schools. They 
are, in a sense, an outside measure of some of the objectives dis- 
cussed in chapter 2 — namely those which are so difiicult to define 
behaviorally. The content-specific tests are more useful diagnos- 
tically with individual students, but care must be taken to insure 
that the content of the battery is appropriate for the students in a 
particular school. 



3. Dale P. Scannell, Tests of Academic Progress (Boston: Houghton 
Mifflin Co., 1971). 



162 Educational Measurement 

Survey Uses: Scholastic Aptitude Tests 

As the name implies, scholastic aptitude tests are used to make 
decisions about a student's future. Often they masquerade under 
some inappropriate titles, such as IQ tests, intelligence tests, or 
tests of general intellectual functioning. These other names too 
often conjure up visions of more than the test is capable of provid- 
ing. A scholastic aptitude test score can be used to make statements 
about a student's probability of future success in school-like tasks. 
Barring some sort of intervention program (change in home or 
school environment, change in motivation, or some other relatively 
major reordering of the student's life), the predictions are fairly 
accurate. 

The classroom teacher's understanding of the limited scope of 
scholastic aptitude tests is of critical importance. Look up from the 
page before going on to the next paragraph and try to think of at 
least two factors which are (a) important for success in life but 
(b) not measured in a typical scholastic aptitude test. 

User's Hint No. 6: A scholastic aptitude test does not sample the 
following important traits: performance on manual tasks, manual 
dexterity, vocational skills, creativity, motivation, personal char- 
acteristics (such as honesty, neatness, integrity, cunning, prompt- 
ness, or orderliness), or anything about the person's social skills. 

Each of the above is important for success in life. A scholastic 
aptitude test measures none of them. Do you see now why the term 
"scholastic aptitude test," implying a limited aptitude, is preferable 
to "IQ" or "intelligence test" which incorrectly implies to many 
teachers a much broader concept? 

Most of the scholastic aptitude scores teachers will see on a 
student's record are from group tests. The original intelligence tests 
were based on individualized measures. The most frequently used 
individualized tests today are the Stanford-Binet Intelligence 
Scale, the Illinois Test of Psycholinguistic Abilities, and the 
Wechsler Intelligence Scale. The differences between group and 
individualized tests are both obvious and important. In a group 
test, the stimulus for the items is a printed page. The test adminis- 
trator has little actual interaction with the people taking the test. 
Obviously, the group tests have a high reading component, which 
is one important reason for the high correlation between scholastic 
aptitude test scores and school achievement. The individualized 
test requires much interaction between the tester, who must be 
highly trained in the skill, and the person being tested. Little 



Standardized Measures 163 

reading is involved. Of course, the high amount of interaction can 
have a possible biasing effect also, such as when an older woman 
tests a male adolescent or when race, ethnicity, or socioeconomic 
status of the two differs sharply. The individualized measures are 
clearly better than the group tests. They are also somewhere 
around 50 to 100 times more expensive to administer, not to men- 
tion very time consuming. 

Publishers of group tests point out the high amount of agreement 
between scores of these tests and scores on one of the individualized 
measures. The correlations are high, approaching the reliability of 
either test. But this high agreement tends to mask certain things 
the individualized tests can do which the group tests cannot. For 
one thing, the group tests are at their best at mid-range. Mid-range 
can be defined as roughly from 85 to 115 or possibly 80 to 120. 
Beyond those limits the actual score from a group test is unstable. 
Thus, if any decision of importance is to be made about a child 
based on scores beyond those limits, the decision should be based 
on an individuahzed and not a group test. 

A scholastic aptitude test contains some items which appear to 
be content-oriented and which differ little from an achievement 
test. "Fill in the oval under the picture that shows the bird" is a 
typical primary test question, where three or four pictures of ani- 
mals are shown, one of which is a bird. Some items are based on 
concepts, again similar to achievement tests, such as the concept of 
"on." "Fill in the oval under the picture that shows the hat on the 
table." For both items, of course, the test administrator would 
be reading the questions to the students. Categorization skills are 
measured by showing three pictures from one category and one 
from another. The task for the student is to cross out the pic- 
ture which is different. For example, the concept of "warm" or 
"inside" could be measured in this manner. Finally, quantitative 
concepts are measured. "Find the box that has three sets of sticks 
in it." 

Tests developed for older students are not read to the group by 
the test administrator. Items like the following are common: 

(Vocabulary) NUISANCE a. friend b. unique c. bother d. prob- 
lem 

(Sentence Completion) For an office so tremendously productive, 

we are surprisingly in de- 
corum, 
a. proficient b. low c. efficient d. lax 



164 



Educational Measurement 



(Verbal Classification) TIN COPPER SILVER 

a. alloy b. compound c. element d. mixtures 

(Verbal Analogy) Flower is to zinnia as fish is to 
a. hook b. water c. salmon d. whale 

In addition, figure analogies (as shown in figure 5) and number 
classification are included. 




Figure 5. 
Figure Analogy 



(Nimnber Series) 7 9 13 19 



a. 25 b. 27 c. 29 d. 32 



One of the major sources for homogeneous grouping of students 
in the classroom (sometimes called ability grouping) is scholastic 
aptitude test results. The practice is controversial and frequently 
criticized.* Because of the low correlation between achievement and 
scholastic aptitude scores, the range of performance even in 
grouped classes is still quite broad. Research has not supported 
the major premise of this practice; namely, that achievement can 
be enhanced by grouping students. The possibility of cultural bias 
in the establishment of these groups is very real also. 

The classroom teacher can use the scholastic aptitude measure 
as a general indicator of the student's potential, bearing always in 

4. For an extensive statement on this question, see chapters 5 and 11 of 
John W. Wick and Donald L. Beggs, Evaluation for Decision-Making in the 
Schools (Boston: Houghton Mifflin Co., 1971). 



Standardized Measures 165 

mind the cautions listed below. Probably of greatest interest would 
be an under-achiever — a child who seems to achieve far below what 
the test indicates is his scholastic potential. 

In interpreting scholastic aptitude scores, these cautions must be 
considered: 

1. The test measures only the kinds of concepts illustrated 
earlier. Reread User's Hint No. 6 for those things not included. The 
score is not a measure of personal worth. 

2. Because of the high reading and arithmetic dependency, a low 
achiever in either area will appear to have a low scholastic potential. 
An individual test would be a better indicator of learning potential 
for such a student, since the reading load would be sharply reduced. 

3. Cultural and environmental differences affect scores. A limited 
background of experiences is clearly a handicap in scholastic apti- 
tude tests. If the student's home environment differs sharply from 
the norm, interpret the scores very carefully. The test is based on 
learned skills. If the child's environment has not provided the 
opportunities to learn the skills, this does not mean the child could 
not learn them. The test score will predict that his learning poten- 
tial is poor. Don't take this interpretation in the case of children 
who have not had much opportunity. 

4. Motivation and effort are necessary for reliable scores. Con- 
sider the three pairs of students described earlier in this chapter. 

5. Think of a test result in terms of a range of scores rather than 
as a specific score. For example, if the cumulative folder lists the 
child's scholastic aptitude as 110, think of this as being somewhere 
in the range of 100 to 120. The scores are estimates, not precise 
statements. A twenty-point range, from 100 to 120 may not 
be the best for every test. Actually, the range of use is determined 
by the use of a measure called the standard error of measurement. 
This concept is discussed in chapter 9. 



Questions 

12. Find a test which has been published on the topic of Social Studies. 
Separate the items into three categories: 

a. Items which require knowledge of specific information where the 
information appears to be of much general importance to stu- 
dents; 

b. Items which require specific information but where you feel the 
information is not of critical importance; and 



166 Educational Measurement 

c. Items which reflect study or thinking skills that you may think 
any social studies program should instill. 

13. Based on the results of your test analysis for question 12, give 
examples of satisfactory and imsatisfactory uses of the test. 

14. Find a scholastic aptitude test which is designed for use at the 
elementary school level. Categorize the items under five headings 
given in the preceding section. 

15. Think back to a course in which you were enrolled in high school. 
Identify from that course a classmate who probably did not test 
particularly high on scholastic aptitude tests but who did very well 
in the class. What traits did that person have which were not mea- 
sured in the scholastic aptitude test? 

16. Think of some adult who is well known to you who probably tested 
very high on scholastic aptitude tests but who has not (by your 
standards) experienced very much success in his or her life. What 
are the characteristics of this person which led to the lack of suc- 
cess? Were they part of the scholastic aptitude test? 



Survey Uses: Vocational Batteries 

Vocational measures can be placed in two general categories. Of 
most interest to teachers are the batteries used by counselors to 
obtain information about a student's relative performance on mea- 
sures of mechanical and abstract reasoning, clerical speed and ac- 
curacy, space relations, and the more common cognitive measures 
(verbal, numerical, language). The Differential Aptitude Test is an 
example in this category. Measures in the second category are 
geared toward specific vocations and are used mostly for hiring or 
promotion purposes. 

The teacher sees achievement test results often and scholastic 
aptitude tests less frequently. The interpretation of vocational 
scores is not something which usually falls into the classroom 
teacher's domain. Still, a very important point must be made about 
the interpretation of the scores from vocational batteries. The 
point applies wherever the score is based on a paper-and-pencil test, 
and not some performance measure. Think about this story: 

Ms. Anderson prepared and presented a three-week unit on the 
topic of Crimes and Justice to the 120 students in her four civics 
classes. She prepared a twenty-point short-essay exam over the 



Standardized Measures 167 

unit. Scoring the tests took a substantial amount of time, as you 
might expect. The salesman from XYZ test company happened by 
the next week. He showed her a 75-item multiple-choice test his 
company sold. The test covered the topic of Youth and the Law, 
which was her next scheduled topic. Scoring the multiple-choice 
test would be a very easy task. 

Ms. Anderson was able to purchase the test. She administered it 
to the same 120 students. Upon scoring the test, she found that the 
ranking of the 120 students under both measures was precisely 
the same. The salesman was jubilant. "Why give those essay tests, 
with the tedious reading task, when our test gives the same re- 
sults?" 

Why indeed? 

Think again of the chart in chapter 4 and especially of the 
column describing different testing purposes. Why did Ms. Ander- 
son administer the tests? To rank the students? Possibly, but more 
likely to see how well they had mastered the unit. Most tests 
should be administered to determine individual mastery of objec- 
tives and not to end up with a ranking system. When Ms. Anderson 
read the essay tests she got a pretty good picture of how well each 
student had mastered the concepts in the Crimes and Justice unit. 
Knowing that the ranking was the same with the second test does 
not necessarily indicate the level of mastery of the students in the 
class. It only indicates that the same student scored highest on 
both tests, another student was second on both, and so forth. That 
top student might have scored only 30 percent correct on the 
second test and still be at the top. 

Of course, Ms. Anderson would have known individual perform- 
ance levels by looking at the absolute results. The point of the story 
is that a high level of agreement between test A and test B does not 
necessarily mean that test B is as good as test A at every task. This 
is especially true for vocational tests. 

The whole discussion which surrounded performance measures 
in chapter 5 should be brought back into focus. If the actual voca- 
tion requires performing some act (typing, filing, loading equip- 
ment, making dies, guiding airplanes to a runway, or fixing 
vehicles), then the measure should really be a performance mea- 
sure. The results from paper-and-pencil measures might correlate 
highly with each, but the decision maker needs to know, "Can the 
person do the job?" not, "How well does the person compare to 
others who have taken the test?" 



168 Educational Measurement 

Single-Purpose Tests 

Do not come to the mistaken impression that achievement batteries 
represent the only standardized testing approach in the schools. 
Well-established, single-purpose, published and standardized tests 
are available in all of the common school subject matter areas. 
These measures are more common at the secondary school level in 
areas like physics, chemistry, all different kinds of math, economics, 
social studies, and various components of Enghsh; but single-pur- 
pose tests have been pubUshed in most pre-high school areas as 
well. One need only check the index of the Mental Measurements 
Yearbooks for a list of examples. 

As mentioned early in the discussion of achievement batteries, 
the single-purpose tests have the disadvantage of having each test 
normed on a somewhat different population. Advantages exist as 
well, however. A test with the single task of measuring algebraic 
achievement will be longer and more complete than a subtest on 
the same topic from a battery. Because more time is concentrated 
on a single topic, these tests are likely to be more diagnostic. 



Measures of Vocational Interest 

Published and standardized measures of vocational interest will 
represent the first entry under the heading of typical performance 
measures. Typical performance measures, as noted earlier, are de- 
signed primarily for the affective domain — interest, attitude, per- 
sonality, values, and motives. For a given individual, all of these 
terms are clearly interrelated. In chapter 11 some thoughts on 
designing measures for the affective domain will be covered. Al- 
though a variety of vocational interest schedules are identified in 
the Yearbooks, the Strong Vocational Interest Blank (SVIB) 
and the Kuder Preference Records dominate the field. Both are 
frequently used with secondary school students and are well-estab- 
lished measures with extensive sets of norms and a considerable 
amount of empirical research behind them. Each is designed to 
provide the student with a profile of vocational interests. 

The SVIB consists of about 400 items in eight different parts. 
Stimuli from the headings "Occupations," "School Subjects," 
"Amusements," "Activities," and "Pecularities of People" are given 
to which the respondent marks "Like," "Indifferent," or "Dislike." 
Examples of the SVIB-type of items are given below: 



Standardized Measures 169 

43. Rum runner LID 

44. Clean the toCet bowl LID 

45. Hippies LID 

Obviously the examples are artificial and not taken from the SVIB, 
but the form is accurate. In the last three sections the respondent 
is required to choose three activities from among ten which he 
would most enjoy and three others which would be least enjoyable. 
The norming of the SVIB is interesting because it exemplifies 
something called criterion keying. The keying is completely em- 
pirical and not theoretical. Suppose the category is "Architect." 
The task is to find out how closely your interests are attuned to 
those in the architectural profession. Here's what the people who 
construct the SVIB do to determine the norms for the "Architect" 
scale: 

1. They administer the entire SVIB to a bunch of practicing 
architects. Then, a random group of "men-at-large" and "women- 
at-large" are chosen who also complete the SVIB. 

2. Items which sharply discriminate between the architects and 
the others are included in the norms. For example, if almost all of 
the architects circle (L) after "Rum Runner" while others (the 
"men-" or "women-at-large) generally circle (T) or (B) this item 
discriminates. It will be included in the scale. If you circle 
the (L) you wiU receive a point or two on the architect scale; 
otherwise, you will receive none. If almost all of the architects 
circle @ after "Hippies" and the others generally circle (T) or 

(L) this item will also be included. However, now you receive a 
point or two for circling (D) and none for circling the others. 

3. Even if all of the architects agree on an item (e.g. circling 
@ on the "Clean the Toilet Bowl" item), the item will not be 

included if it does not discriminate. That is, if the other men and 
women also circle @ , the item does nothing to differentiate the 
groups and will not be included in the architect scale. 

The report which the student and his counselor receive contains 
his agreement scores with some fifty or so vocations. Separate 
norms are prepared for men and women. Longitudinal research has 
indicated that vocational interests, as measured by the SVIB, are 
fairly good predictors of the high school student's eventual voca- 
tional choice. 

The SVIB questions are usually in the free-response mode. A 
person could "Like," be "Indifferent" to, or "Dislike" everything. 



170 Educational Measurement 

The Kuder type items do not allow this luxury. An example of the 
form (but obviously with phony questions) is given below: 

V. Clean the toilet bowl ♦ V 

W. Have a tetanus shot W ^ 

X. Attend a flower show X 

In the left hand column the respondent blacks out the most pre- 
ferred choice and in the right hand column the least preferred. This 
respondent most preferred cleaning the toilet bowl and least pre- 
ferred the tetanus shot. 

The profile returned from a Kuder indicates areas of general 
interest, rather than congruence of interest with specific occupa- 
tions. The interest areas are outdoor, mechanical, computation, 
scientific, persuasive, artistic, literary, musical, social service, and 
clerical. The student and counselor infer vocations from interest 
patterns. A student interested in architecture, for example, might 
be expected to score high on the computation, scientific and artistic 
scales. 

The choice of vocation is a critical one. Any help the student can 
receive should be welcome. If the respondent is completely candid 
and makes "typical" responses, the vocational interest measures 
can provide assistance. However, anyone using a vocational interest 
scale should keep in mind these cautionary notes: 

1. The interests can be faked. If you tell a group of high school 
seniors "Respond as if you'd like to be a minister" they can 
probably do so. If the respondent feels it is important to appear 
to be something he really is not, he can do so. 

2. Interest does not necessarily predict aptitude. The flunking 
algebra student with an interest in structural engineering is being 
unrealistic. An interest in piano will never be consummated suc- 
cessfully without finger dexterity. 

3. Interests change as new horizons are opened. Substantial 
numbers of students change majors in college. People past thirty 
have been known to make radical shifts in occupations. Always 
interpret interest inventory results with this introductory state- 
ment: "The way this person looks now . . . ," which implies that 
the person in question might change later on. 

4. You really need to know something about an occupation 
before you can manifest a serious interest in it. Most people know 
what surgeons and radio announcers do, but how much do high 
school seniors know about actuaries or statisticians? If the admin- 
istration of one of the interest inventories identified a group of 



Standardized Measures 171 

possible occupations for a student, the next step is to give the 
student some information about each of these. Most professional 
organizations (architects, statisticians, teachers, electrical engi- 
neers) have a little booklet describing requirements for and oppor- 
tunities in the field. These should be made available to apparently 
interested students. 

5. In the past, many occupations were dominated by a single 
sex. Engineers were men but elementary school teachers were 
women. Men delivered mail and women said "Number, please." The 
hnes have become fuzzy now as each sex moves into the other's 
domain. The changes are overdue and are probably irreversible. 
The vocational interest measures will have to react quickly. At the 
present time a woman who receives a score on the "architect" scale 
of the SVIB is probably being compared to very few women archi- 
tects. As more women enter fields previously occupied primarily by 
males (and males enter those previously occupied by females) the 
norms will have to be current. 

Questions 

17. Devise six statements of activities which would probably be enjoyed 
by a person who likes to be outdoors and alone. 

Devise six statements of activities which would probably be enjoyed 
by someone with a high level of music appreciation. 

Devise six statements of activities which would probably be enjoyed 
by someone who is intellectually curious and quite scholarly. 

a. Now, put the 18 statements in a survey in the form of the SVIB. 

b. Next, put the 18 statements in six groups of three each in the 
form of the Kuder. In each triplet, include one statement from 
each of the three categories. 

18. Administer the two measures constructed in 17 to a group of at 
least twenty people. 

a. Do the two forms give approximately the same results for each 
individual? That is, are there individuals whose apparent interest 
changes due to the format of the measure? 

b. To what extent do the 18 items differentiate among the people 
who responded? Did you have some people who scored high in 
each of the three areas? 

c. Interview the person who scored highest in each of the three 
areas. Are the vocational goals of these people in line with the 
measured interests? 



172 Educational Measurement 

Published Measures of Attitude and Personality Constructs 

"His attitude is so bad, how could he possibly learn?" 
"Her attitude is so good, I'm sure she'll learn now." 
The close connection between the affective domain and cognition 
is recognized by teachers. Learning programs, research projects, 
and curriculum innovations usually have objectives in both do- 
mains. The measurement of attitude — especially change in attitude 
— is important to most evaluation efforts. Given this importance, 
one would think that published, standardized attitude measures 
would abound. Actually, this is not so, and the reason is not 
difficult to ascertain. 

After all, "attitude" usually must end with "attitude toward . . ." 
and the ". . ." will generally be very specific. Before a published 
measure can be profitable it needs a large market. When the "atti- 
tude toward . . ." is directed to something quite specific, the market 
will be small. Thus, the only attitude measures listed in the Year- 
books will be directed toward general things. Attitude toward 
teaching is an example of an available published and standardized 
measure. 

A personality construct is a psychologist's way of shorthanding 
a bunch of frequently vaguely defined behaviors with a word or two. 
For example, think of people who tend to talk a lot, work sales jobs, 
and approach others rather than waiting to be approached. They 
seem enthusiastic in a crowd and prefer company to solitude. This 
general group of descriptions is encompassed by a single word: 
extrovert. Introvert and extrovert are the opposite poles of a 
personality construct. Psychologists define the construct and soon 
standardized measures of the construct appear. Tolerance, adjust- 
ment, and commonality are other examples. 

Anyone who uses a standardized attitude or personality measure 
should understand how these scales are devised. The information 
can be obtained from the publisher in a technical report. As you 
shall see, knowledge of the construction procedures might alter 
your interpretations. 

Suppose, for example, the goal was to construct a measure of 
introversion. One approach is based on content-validity. A series of 
items would be constructed to conform to one definition of intro- 
version. You might write an item like this: 

1. I like to meet people in crowded bars. 

( ) ( ) ( ) ( ) ( ) 

SA A ? D SD 



Standardized Measures 173 

Here the respondent marks from Strongly Agree to Strongly Dis- 
agree. A mark near the SD end would indicate a tendency toward 
introversion. This, by the way, is called a Likert-type of item. Or 
you might ask this item: 

1. a. Meet a friend at a crowded bar. 

b. Go hiking in the woods. 



Here the respondent chooses the one he prefers. This is a forced 
choice or ipsative measure. In either case, a series of questions 
would be constructed to conform to what is commonly thought of 
as "introversion." In the tryout of the item, internal consistency 
would be sought. That is, if all items are presumably measuring 
the same trait (introversion) then a person with introverted char- 
acteristics should respond to all in the expected introverted direc- 
tion. Extroverted people should respond typically at the other end. 

Consider another and very different approach to constructing 
the introversion measure. Why not establish external criterion 
groups and then find items which discriminate one from the other? 
This approach is called the empirical criterion keying method. Here 
the test developer first finds a group of people who are considered 
to be quite introverted and a second group of equal size considered 
to be extroverted. How are the two groups defined? You might ask 
each class to vote on the most introverted or extroverted person 
and assemble the "winners" from each of 100 classes. Or, you might 
use charter members of the Bird Watching Society as your intro- 
verts and an equal number of disc jockeys as the extroverts. Some- 
how or other, though, you must externally establish the criterion 
groups. 

Next, a long list of items is administered to your criterion groups. 
Just as with the norming of the Strong Vocational Interest Blank, 
items which discriminate between the two groups are included in 
the scale. If most of the externally determined introverts answer a 
question one way, while the extroverts respond in the opposite 
direction, the item will be included in the scale. 

The two construction techniques are not all that distinct. Mea- 
sures constructed by the content-validity technique are often 
empirically validated with outside criterion groups, and measures 
devised through empirical keying are checked for internal con- 
sistency. Interpretation of scores from either kind of measure 
requires this hint: 

User's Hint No. 7: If the measure was constructed through the 
content-validity approach, check the items to see if they agree 



174 Educational Measurement 

with your perception of what the construct means. If criterion 
keying was the basis for construction, make sure you understand 
how the reference groups were selected to see if that definition of 
the construct agrees with your own. 

Did you agree with the criterion groups suggested for the intro- 
version scale? Perhaps the test developer could have found a few 
hundred solo fishermen or mountain climbers and defined these as 
the "introverted" group. You must know how a term is defined 
before you can make any meaningful interpretation of the score 
from the measure. 

Here are some other notes of caution about the interpreting of 
standardized measures in a personality or attitude construct. 

1. Just as interest does not predict aptitude, attitude does not 
necessarily predict behavior. A teacher whose attitude toward be- 
havioral objectives seems very positive may never actually use 
behavioral objectives in the instructional process. Why does ex- 
pressed attitude not always predict behavior? Three reasons seem 
prominent. 

a. Sometimes a person does not have enough information to be 
accurate in the assessment. The teacher who expressed a positive 
attitude toward behavioral objectives may not have, at that mo- 
ment, understood the amount of effort which must go into using 
behavioral objectives in instruction. 

b. Sometimes an honest response is socially unacceptable. The 
respondent gives you what is perceived to be the more acceptable 
answer. Again, if the principal of the teacher's school had adminis- 
tered the attitude questionnaire immediately following an eight- 
week in-service program on behavioral objectives, the teacher might 
have been intimidated into insincere positive feelings. 

c. People do change. An honest attitude expressed at time A 
might have changed by time B when some behavior is manifested. 

2. Personality constructs tend to be situation specific. Is an 
introverted person always introverted? Can you say, "That man is 
honest," and "That man is not," or "He is tolerant," and "She is 
not" with any certainty? The statement has to depend on the situa- 
tion. Does there exist a man or woman who would never steal? 
Never lie? Who is always tolerant? Never tolerant under any con- 
ditions? Is it difficult to conceive of a woman who is quiet but 
cheerful at work (medium introversion), quiet and reserved at 
home (more introversion), but outgoing and boisterous among 
friends (little introversion)? Such a woman would respond with 
three different scores on the same scale, depending upon the 



Standardized Measures 175 

situation she perceived herself as being in when she responded to 
the questions. Do not look at the score as static and unchangeable 
even over brief time intervals. 

3. People tend to over-define a construct. That is, the psycholo- 
gist may try to define tolerance with an operational definition. An 
operational definition simply means the psychologist explains how 
the concept has been measured. The layman is liable to attach a 
more personal meaning to the term. These personal meanings are 
often sharply contradictory from person to person. Take the con- 
struct of self-concept. You have some feeling for the meaning of 
the term, so answer these questions: Which of these two middle- 
class teachers would be more successful in a ghetto assignment — 
one with a strong self-concept or one with a weak self-concept? 
Who would make a better actor — a man with a strong self-concept 
or a man with a weak self-concept? People will tend to differ on 
these answers. Much of the difference will be attributed to the 
generally indistinct and often contradictory synonyms which are 
attached to the term. The psychologist's operational definition is 
narrow. The layman will generally go well beyond the operational 
definition and over-define the term. An excellent example of the 
over-definition of a construct is the layman's general interpretation 
of "IQ." 



Question 

19. Think of one example of a case where the expressed attitude of 
college students conflicts with their behavior in the actual situation. 



Item Formats 

Stimulus mode is the way a question is presented. Response mode 
is the manner in which the student expresses an answer. Before 
you ever allow a standardized measure to be administered to your 
students, study the stimulus and response modes on the measure 
with an eye toward answering this question: Will either mode get 
in the way of any of your students showing his or her true per- 
formance level? 

How could a question be presented in a standardized measure? 
Primarily, the mode is printed, involving words or pictures. Occa- 



176 Educational Measurement 

sionally, the stimulus is given orally, but this is the exception and 
not the rule. Movies, video tapes, slide/tape synchronizations, or 
role playing would be very effective stimulus modes, but each 
would be administratively expensive and tedious to the test pub- 
lishers, which probably explains why each is little used. 

The student responds, usually, by making some sort of mark. 
The most common technique is by marking a separate answer 
sheet, although with tests for younger children separate answer 
sheets are discouraged. Young children usually respond by circling 
or x-ing out something directly on the test forms. In an individually 
administered test, the student can respond orally to the tester. 
This, of course, is impossible with group tests. 

How can the stimulus or response mode get in the way of the 
student's showing you what his true performance level is? Consider 
these examples: 

EXAMPLE 1: A question which might well be found in the 
math section of an achievement battery is this one: 

At the banquet 36 tickets were sold to adults. 12 children's 
tickets were sold. The tickets were 300 for adults and 100 for 
children. How much money was collected for the tickets? 

a. $ 1.20 

b. $11.00 

c. $12.00 

d. $14.40 

e. None of the above. 

Assuming the student has had experience with this type of problem, 
how could the stimulus and response mode get in the way of his 
showing you what he knows? How is each different from the usual 
classroom approach to the question? Here are some ways: 

The way the question is asked is probably similar to what the 
student is used to in the classroom. 

Think, though of differences in the manner of response. 

a. The student will have to do his work elsewhere. Usually, in 
the classroom, the work is done right on the test form. 

b. Rather than arrive at an answer and circle it, the student will 
have to arrive at an answer on a piece of scratch paper, look for 
that answer among the possible responses, and mark the proper 
letter for the response on a separate answer sheet. 

c. When the student completes a problem in the classroom, the 
answer is either circled or else the work is rechecked and then 



Standardized Measures 177 

the answer is circled. With the standardized test, an answer which 
appears among the possible responses is marked on the answer 
sheet. If the student's answer does not appear, he either assumes 
he has made a mistake, or option "e" (None of the above) is 
correct. A student may not be familiar with this option. The effect 
of this unfamiliarity will be a matter of individual difference and 
some will be bothered by it, while others will not. 

What seems like a routine item in a very familiar format could 
cause some real problems to certain individuals. The problem is 
basically in the response format for example 1. Now consider this 
example of a question which is similar to many found in the 
language section of achievement batteries: 

EXAMPLE 2: . . . (three paragraphs of a letter precede this one 
in the item) 

11 Please call me at a convenient 

12 time like you mentioned I'll be 

13 happy to discuss this further. 

How could line 12 be best changed? 

a. time like you mentioned and I'll be 

b. time as you mentioned. I'll be 

c. time like you mentioned. I'll be 

d. time as you mentioned and I'll be 

How is a student's language usage usually evaluated in the 
classroom? Does the teacher present a paragraph and ask the stu- 
dent to improve the lines? Usually not. It is more hkely that he is 
asked to write a paragraph or a paper on some topic. The student 
creates statements which are occasionally wrong and will then be 
corrected by the teacher. Editing the copy of others is not common. 

Even experienced copy editors would be unfamiliar with the 
response format. Copy is edited by crossing out the offending 
passage and writing the correction above. How many students will 
have multiple-choice response experience for copy editing? 

Don't come to the mistaken conclusion that just because your 
students may be unfamiliar with the stimulus and/or response 
modes the standardized measure should be junked. Whenever you 
or anyone else is going to administer a standardized measure to 
students, this series of steps should be adopted: 

First, study the measure to see what the stimulus and response 
modes are. 



178 Educational Measurement 

Second, practice these modes with the students — ^not with the 
actual test or even directly in the content area of the test. Keep 
practicing until the format becomes second nature to the students. 

Then administer the test. 

That's cheating — teaching for the test, you say? Not so. You 
teach for the test if you present the actual items in the test or very 
similar items, or if you find out the topics stressed by the test and 
arrange your instructional program to conform. The above exhorta- 
tion is simply a reinforcement of the cardinal rule of testing, stated 
earlier, but rephrased here: Don't let the item format get in the 
way of the student showing you what he knows. 



Summary 

Most classroom testing is of the teacher-made variety, but the 
widespread use of standardized, published tests is also a fact of 
classroom life. To use the results wisely with students, the teacher 
should know how these measures are constructed and standardized. 
The standardization, or norming, process allows the teacher to 
make comparisons of each student's performance to a "typical" 
group of the student's peers around the country. 

The published test has been contrasted to the more typical 
teacher-constructed variety of measure, and the primary points of 
difference surfaced. The teacher-made test is more likely to be 
geared to some specific learning task, will not usually have repeated 
usage, and can be written to reflect the idiosyncrasies of a particu- 
lar group of students. The published test is constructed for a more 
general audience, requires much more elaborate planning, and con- 
centrates on comprehensive areas of learning, rather than on 
specific programs. 

Obviously, a published test is not available for every conceivable 
educational learning program. Tests are only published where the 
potential using audience is large enough to justify the publishing 
costs. Suggestions for searching out available sources for published 
tests have been presented in the chapter. 

Confusion frequently arises over the distinction between achieve- 
ment and aptitude tests. This operational definition has been pre- 
sented: If the measure is used to establish a current level of 
performance or mastery on some concept, the measure is an 
achievement measure, but if the measure is used to make some 
decision about the student's potential future performance in the 



Standardized Measures 179 

given area, the measure is an aptitude test. Format, content, or 
even the name on the front of the test are not the critical variables 
in making the differentiation. The differentiation depends on the 
use made of the results. 

Pubhshed tests are available for three different diagnostic pur- 
poses. Readiness tests are administered to determine if the student 
has the necessary background to satisfactorily be entered into 
some learning program. Mastery tests are concentrated on de- 
termining which particular parts of the instructional program re- 
quire additional work. These two types, readiness and mastery, are 
generally used with all students. Another type of diagnostic instru- 
ment, the problem identification test, is frequently available to 
diagnose problems for students who have been entered into a learn- 
ing program, but have not experienced success. These measures 
attempt to diagnose the particular reasons for the non-success. 

Standardized measures are most widely used in the sense of a 
survey of accumulated learning. Batteries of achievement tests are 
widely used in the schools. Survey aptitude measures, to assess the 
potential for learning school-related topics, and vocational aptitude 
batteries are also common. The chapter provides a series of sug- 
gestions for potential users in an effort to help teachers avoid some 
of the more common misuses of results in the classroom. 

Most people are familiar with maximum performance tests. 
Achievement and scholastic aptitude measures fall under this 
heading, which implies that the person being tested will perform 
at the maximum level of his or her ability. Typical performance 
measures attempt to find out from the student what his or her most 
common response is to a given situation. Typical performance 
measures are written to reflect interests, attitudes, values, or per- 
sonality constructs. Some of the more common typical performance 
measures are described. Results from these kinds of instruments 
can be badly misused by classroom teachers, and a whole series of 
hints for users have been presented in the chapter. 

Finally, the teacher is cautioned to check carefully the stimulus 
and response modes in any published measure. The format of the 
measure should not be allowed to interfere with the student being 
able to show what his true performance level is on that measure. If 
the students are unfamiliar with some aspect of the test's format, 
practice sessions are urged. These sessions would, of course, concen- 
trate on familiarization with format, and not on famiharization 
with the actual items of the measure. 



8 



Reliability or 
"Could We Find 
Our Way Again?" 



One Christmas my sister predicted that a new bike was in my 
future. She was right. A certain political analyst correctly predicted 
Truman's stunning victory in the 1948 presidential race. An eco- 
nomic seer called a certain economic decline accurately. Each, at 
the time of the actual correct measurement, was viewed with high 
esteem. Later, each fell from grace. Why? Because they couldn't 
repeat the trick later. They were unreliable. 

Reliability is also a testing word. A reliable test is one which 
repeats the trick again and again. In the next chapter, the concept 
of validity will be introduced. Validity refers to measuring the right 
thing. A valid test measures what it is supposed to measure. 
Validity is more important than reliability — who cares if a test is 
precise (reliable) if it is measuring the wrong thing? But both must 
go together. Before a test can be valid, it has to be reliable. 

In this chapter, four techniques for computing reliability will be 
presented along with "how-to-do-it" sorts of examples. After a 
section contrasting the four, some general cautions and limitations 
of reliability coefficients will be covered, including the problem of 
computing reliability coefficients with criterion-referenced tests. 



Techniques for Computing Reliability 

Split-Halves Method 

The split-halves reliability coefficient is the easiest and most prac- 
tical one for a classroom teacher to use. The split-halves method is 

180 



Reliability or "Could We Find Our Way Again?" 181 

based on a measure of internal consistency. The presumption is that 
a reliable test is one which is internally consistent. Here are scores 
from three pupils on two different tests. The first is internally con- 
sistent. The second is not. 

TEST A (the reliable one according to internal consistency) 

Student Test Items: 1 = correct; = wrong Odd Even 

Name 123456789 10 Nimnbers Nimibers Total 

Laura 1111111110 5 4 9 

Mary Jo 1001110110 3 3 6 

John 1110001011 4 2 6 



Avg = 4 Avg = 3 Avg = 7 

TEST B (the unreliable one according to internal consistency) 

Student Test Items: 1 = correct; = wrong Odd Even 

Name 123456789 10 Numbers Numbers Total 

Mildred 1111010111 3 5 8 

Milhouse 1010111011 5 2 7 

Myron 1101010101 1 5 6 



Avg = 3 Avg = 4 Avg = 7 

The first test is internally consistent. For each of the three 
students, the scores on both halves of the test are about the same. 
For each student on Test B considerable variation occurs in the 
half -test scores. This test is not internally consistent. A spht-halves 
reliability coefficient would show it to be unreliable. 

Usually, the requirement that a test be internally consistent is 
reasonable. When a teacher writes a test, the test generally covers 
a single concept. If the test scores are going to vary, it should be 
because those taking the test are at variable levels of mastery of 
the concepts covered. If the test items are generally measuring the 
same concepts, and if certain students are generally better pre- 
pared to respond to those concepts, then it follows that those 
students should do well on both halves of the test. The same 
statements can be made substituting "less prepared" for "better 
prepared." If the two half-test scores differ for each student, as they 
do for Test B, it means either (a) the test is imprecise and unre- 
liable; or (b) the test is measuring two or more distinctly different 
concepts which happen to assume odd-even item positions in the 
test. The first reason is more likely. 

It is possible to compute what is called a correlation coefficient 
for reference as a measure of reliabihty. A lot of different correla- 
tion coefficients are alive and well in this world, but the most alive 
one is called the Pearson Product-Moment Coefficient. Because you 



182 Educational Measurement 

will use it a number of times in this and other chapters, some 
computational examples of the coefficient will be shown. 

To get a feel for "product-moment," reflect for a moment on the 
common seesaw (or teeter- totter, as Minnesotans call them). If a 
fat person wishes to teeter-totter with a skinny waif, some distance 
corrections must be made. The fat person has to sit near the ful- 
crum (the place whereupon the seesaw sits) and the skinny waif 
must sit far away. The product of weight and distance for each 
must be the same. For example, if the waif's weight is 100 pounds 
while the fat person's is 200, then the heavier person should sit 4 
feet from the fulcrum and the other person 8 feet away from the 
fulcrum. The products would be 200 times 4 and 100 times 8, 
which both equal 800. The two would balance each other. 

Enough physics. Back to testing. Laura's scores on Test A were 
5 on the odd numbers and 4 on the evens. Think of the means as 
the fulcrum and the two scores as the fat person and the waif. How 
far is Laura's first score from the fulcrum? 5 is the score and the 
average is 4, so 5 — 4 = 1 — one unit. How far is her second score 
from the fulcrum (or average)? The score is 4 and the average is 
3, so it is 4 — 3 or 1. Her contribution to the product moment part 
of the coefficient is the product of those two differences. It is 
(5 — 4) X (4 — 3) or 1 X 1 = 1. The three product moment 
contributions (by Laura, Mary Jo, and John) are shown below: 



Laura 


(5 - 


- 4) 


(4 - 


- 3) = (1) (1) = 1 


Mary Jo 


(3 - 


- 4) 


(3 - 


- 3) = (-1) (0) = 


John 


(4 - 


- 4) 


(2 - 


- 3) = (0) (-1) = 



Product-Moment Contribution, 1 

Now, before going on, consider what happens with the three 
scores in Test B. 

Mildred (3 - 3) (5 - 4) = (1) (1) = 

Milhouse (5 - 3) (2 - 4) = (2) (-2) = -4 

Myron (1 — 3) (5 — 4) = (—2) (1) = —2 

Product-Moment Contribution, —6 

The first point to tuck away in your brain is that high internal 
consistency leads to high product-moment sums for the students, 
and low internal consistency leads to lower sums. 

Three factors are needed to move from the sum of product- 
moment contributions to the famed and fabled Pearson Product- 
Moment Correlation Coefficient. One is number of pairs of scores, 



Reliability or "Could We Find Our Way Again?" 183 

labelled N. Here, N — 3. The other two bits of information are the 
standard deviations of each list of scores. "Deviation" here means 
something other than acts reflecting questionable sexual practice. 
The standard deviation of a list of scores refers to the variability 
of the scores on the list. It is the square root of average squared 
deviation of the scores from the mean. You'll understand if you 
look at the example again. The mean for the odd-numbered items 
was 4, you recall. The standard deviation is then: 

Squared 
Score Mean Difference Difference 

Laura 5 4 11 

Mary Jo 3 4—1 1 

John 4 4 

Sum of Squared Differences 2 
Average Squared Difference 2/3 = 0.67 

(divide sum by N) 
Square Root of Average 0.82* 

Squared Difference 

Thus, the standard deviation of the three scores on the odd num- 
bers is 0.82. The standard deviation for the three scores on the even 
part is computed below: 

Squared 
Score Mean Difference Difference 

Laura 4 3 11 

Mary Jo 3 3 

John 2 3—1 1 

Sum of Squared Differences 2 
Average Squared Difference 2/3 = 0.67 
Square Root of Average 0.82 

Squared Difference 

Now, the Pearson Product-Moment Correlation Coefficient be- 
tween the two lists of three scores (symbobzed by r) is given: 

sum of product-moment contributions 

r = 

(iV) X standard deviation X standard deviation 
of odd scores of even scores 



or 

* You don't have to dig out your old square root computing skills. A 
table is provided in the back of the book which will supply all the information 
you should need. 



184 Educational Measurement 



r = = 0.50 

■ (3) X (0.82) X (0.82) 



Question 

1. Before going on, compute the Pearson Product-Moment Coefficient 
between the two lists of three scores for Test B. The answer will be 
-0.87. 



Reviewing thus far, you have seen that 

a. Test A was internally consistent because the high scorer on 
the odds was also high scorer on the evens, and vice versa; 

b. The more internally consistent a test is, the higher will be 
the sum of product-moment contributions; and 

c. The actual coefficient is found by weighting the product- 
moment contributions by three factors: numbers of pairs, 
standard deviation of odd-item scores, and standard deviation 
of even-item scores. Add to these observations 

d. The higher the correlation coefficient, the more internally 
consistent is the test, and vice versa. 

Test A's coefficient was 1.00 while Test B was much lower at 
-0.87. 

One further computation must be made before the split-halves 
reliability is known. A test contains a certain number of items, and 
usually the items are a sample from some larger domain of content. 
When one samples, all other things being equal, the larger the 
sample the more reliable it is. Using the coefficient 0.50 for Test A 
is unfair to the test. It is really ten items long, but the coefficient 
is based on only five items. How much higher would the reliability 
have been if it had been based on the full ten items? 

A formula developed by Spearman and Brown called (surpris- 
ingly!) the Spearman-Brown Prophecy Formula can be used to 
give the test credit for its real length. The formula reads: 



Reliability or "Could We Find Our Way Again?" 185 

This is common symbolism of the type you'll see in publisher's 
manuals and things like that so you might as well know what it's 
all about. The r is the correlation coefficient you computed for 

2 2 

the two half tests. The rn then is the predicted or prophesized 
formula had the computation been based on the whole test (ten 
items) rather than just half. For Test A, the predicted split-halves 
reliability coefficient is 

(2) (0.50) ^^^ 
'" = 1 + 0.50 = ^-^^ 



Questions 

2. A 12-item test has a correlation coefficient between the odd and even 
halves of 0.40. What is the prophesized split-halves reliability 
coefficient for the whole test? 

3. Take Question 2 one step further. What would be the prophesized 
reliability had the test been 24 items long? 

4. Now try 48 items and 96 items. Do you think the coefficient will ever 
go over 1.00? 

5. Five students take a 12-item test. Here are the results. 

Student ITEM 

123456789 10 11 12 Odds Evens Total 

Nancy 011111111 1 1 1 

Jan 101011111111 

Peter 111011000 

John 110011001 1 1 1 

Tricia 001101011 1 

a. Fill in the "odds," "evens," and "total" entries for each student. 

b. Find the average for each half-test. 

c. Find the standard deviation of each half-test score. 

d. Find the sum of the product-moment contributions. 

e. Find the correlation coefficient for the half -test. 

f. Find the predicted split-halves coefficient for the whole test. 



Before moving on to the rest of the chapter, these points need to be 
reviewed: 

1. You now have at your disposal a technique for computing the 
most practical reliability coefficient for a classroom test. Three 
other techniques will be described. These last three will often be 



186 Educational Measurement 

encountered when you read about tests written by others. If you 
decide to use a split-halves coefficient on some classroom test, follow 
these steps: 

a. Set up your score sheet showing the half -test (odd-even) 
scores for each individual. 

b. Find the mean of the list of scores on the odd items and for 
the even items. 

c. Find the product-moment contributions. 

d. Find the two standard deviations. 

e. Find the basic coefficient (r); then find the predicted or 
prophesized coefficient. 

2. The computations heretofore have been pretty simple because 
the nmnbers were small. The means and differences were whole 
numbers. In practice, things don't usually turn out so nice. You 
could compute reliabilities, standard deviations, and correlations 
using only the information given in this chapter. If you want a 
more detailed explanation of these statistics, including some other 
computational formulae, consult any elementary statistics text.^ 

1. Most books will use fairly standard symbolism. The formula for the 
standard deviation of the odds and evens would be given by 



h (X, - 


X.)2 


N 


1^ (X, - 


• Xe) 2 



N 
Where 

So, s. = standard deviation of odds and evens, respectively. 

In general, s stands for standard deviation, and a 

subscript is read "standard deviation of " 

where the blank is filled by the subscript. 

2 is read "simi of the . . " 

Xo, Xe stands for the score for one student on the odd and 

even numbers respectively. 

Xo, Xe is the mean of the scores on the odd and even 

numbers respectively. 

2 (Xo — Xo) 2 means the "sum of all the squared differences." 
You did these as part of the example. 

2 (X, - X.) (Xo - Xo) 

'^ ^^e 

where 2 (Xo — Xo) (Xo — X.) is the "product-moment contribution" for 
all the people in the class. 



Reliability or "Could We Find Our Way Again?" 187 

Kuder-Richardson Method 

A second method (or actually group of methods) also is based on a 
measure of internal consistency. Only one testing is required — a 
practical necessity for the average classroom teacher. The split- 
halves method was based on the assumption that the odd and even 
items make up two fairly equivalent tests. The Kuder-Richardson 
technique assumes that each item is highly correlated with every 
other item. That is, a well-prepared student will have about the 
best chance of scoring each of the items correct; a student with 
average preparation about average probability; and the least pre- 
pared will have the lowest probability on each item. 

What would make these between-item correlations tend to be 
lower? One contributing factor would be the non-homogeneity of 
the items. Usually, a test is designed to measure a single (albeit 
general) concept. Suppose a test included 10 spelling items, 10 
items measuring ability to visualize in space, and 10 algebra prob- 
lems. The best spellers may not be the best visualizers or algebra- 
ists. These good spellers will not have the best chance on any of the 
non-spelling items. This test could never be internally consistent. 

But another cause of low internal consistency can be poor item 
construction — assuming the test has been designed to measure a 
single domain. Sometimes an item is a bit ambiguous. The ambigu- 
ity, however, is only apparent to the well-prepared student. The 
poorer ones blast right through to the correct answer while the 
better ones ponder and think and sometimes choose the wrong 
answer. Such items are called negative discriminators — the low 
performers overall get a certain item right more frequently than the 
overall high performers. This shouldn't happen if the test is mea- 
suring a single concept. If a test has been written about a single 
general concept (like consumer law or weights and measure conver- 
sions) the internal consistency coefficients will tell you how much 
ambiguity has been avoided in the item construction. 

How high should these reliabilities be? As will be seen shortly, 
reliability depends on some other factors which must be considered 
in answering the question. But all other things being equal, a 
published test ought to have a reliability in excess of 0.85. A first 
try by a teacher could be expected to exceed 0.70. Since a test's 
validity is limited by its reliability (a test cannot measure the right 
thing if it cannot measure anything with precision), it should have 
reliability in excess of 0.40 or 0.50 to be of any value. 

One final note: The classroom teacher probably would use a 
split-halves technique for hand computations. Test publishers with 



188 Educational Measurement 

computer programs are more likely to go to the Kuder-Richardson 
forms. How are they related? It can be shown that, in general, the 
split-halveS coefficient will be little higher than the Kuder-Richard- 
son coefficient. The difference will be around one to four percent. 
Thus, if the Kuder-Richardson coefficient for a test is 0.80, the 
spUt-halves coefficient for the same test will hkely be 0.81 to 0.84. 

Test-Retest and Alternate Forms Methods 

To determine test-retest reliabihty one does just what the name 
impHes. You sit a group of students down and administer a test; 
then, sometime later, you repeat the process — same test, same 
students. The coefficient is found by computing the Pearson 
Product-Moment Correlation Coefficient between the two lists of 
scores. 



Question 

6. Just to get a feel for the test-retest situation, here are the scores from 
four (4) students when the same test was administered to them on 
Monday and on Wednesday. 

STUDENT 

Tom 

Dick 

Harry 

Hubert 

a. Find the average score for Monday and Wednesday. 

b. Find the product-moment contributions for the four students. 

c. Find the standard deviations for the two lists of scores. Call them 
s^ and s^ for "standard deviation of Monday scores" and "stan- 
dard deviation of Wednesday scores. 

d. Find the Pearson Product-Moment Correlation CoeflScient. (Now, 
since the whole test is used to find the coefficient, there is no need 
to use the prophecy formula.) 



A problem with the test-retest approach is that a student will 
tend to remember the items from the first administration to the 
second. Tom may have spent 90 seconds figuring out an item on 
Monday. On Wednesday, when he sees it again, he will probably 
remember the earlier reasoning. 



SCORE MONDAY 


SCORE WEDNESDAY 


19 


20 


21 


21 


22 


22 


18 


21 



Reliability or "Could We Find Our Way Again?" 189 

To avoid the "remembering" problem, alternate or parallel forms 
of a test are often constructed. Each form attempts to sample the 
same domain of content or skills, but in slightly different maimers. 
For example, look at this item: 

10. Tom had three times as many marbles as Laura. Tom had 12 
marbles. How many marbles did Laura have? 

The task requires converting "three times as many" for Tom to 
"one-third as many" for Laura; then taking % of 12 to get 4. An 
alternate or parallel item might be: 

10. Ellen worked four times as long as Clint. Ellen worked 16 
hours. How long did Clint work? 

Notice that the computational task for each is about the same. The 
most likely incorrect answers (3 X 12 = 36 for the first form and 
4 X 16 = 64 for the second) are about equally difficult computa- 
tions. 

Most publishers of standardized achievement test batteries have 
two, and often three, alternate forms. The reason is not simply 
because students will remember. After aU, the testing is at one-year 
intervals, so remembering cannot be much of a factor. Besides, as 
the student moves to the next grade, he will take a higher level of 
the test. The reason for alternate forms is to foil the teacher more 
than anyone. If the same battery is administered year after year, 
the teacher will have a strong tendency to "teach for the test" — not 
a good thing unless the test is criterion-referenced and covering all 
important objectives. Of course, if the teacher was bent on "teach- 
ing for the test" (heaven forbid any present readers would do such 
a thing! ) he or she could simply copy down the concepts covered. 
The same concepts will be found in the form used in the following 
year even if the actual items are different. 

To compute an alternate forms reliability coefficient, one simply 
administers the two forms to the same group of students. This gives 
two scores for each student. Then, a Pearson Product-Moment 
Correlation Coefficient is run between the two lists — just as you did 
for the two lists of half-test scores or for the two lists in the 
previous exercise. 

The internal consistency measures (split-halves and Kuder- 
Richardson) required a single test administration. The test-retest 
and alternate forms techniques require that the same group of 
students be tested twice. As you might expect, the interval between 



190 Educational Measurement 

testings is important. What will be the effects of a short interval — 
say after an hour or even a day? The test-retest reliability will be 
quite high, since many will remember their correct answers. If the 
alternate forms are well-constructed in a parallel manner, high 
correlation should result. As time passes, however, the students 
change, and the change is selective. Some learn the content covered 
by the test, others do not. As time goes by, the two reliability 
coeflBcients drop. 

The moral to the story: When a test publisher gives a test-retest 
or alternate form coefficient, demand that the testing interval also 
be given. Alternate form reliabilities should be high when the re- 
testing interval is short. Coefficients around 0.90 should be expected 
for published tests. 

The Use of a Test-Retest Model to Assess Change Over Time. 
Alternate forms are nice for test publishers but most people who 
test have trouble enough coming up with one decent test. So very 
often a learning program of some kind requires a pre-test (before 
the action begins) and a post-test (to see if anything happened). 
The easiest approach is a test-retest one, but two problems develop : 

a. As mentioned earlier, people remember responses from the 
first testing. The two tests are thus different, since one demanded 
solutions to new problems, and the other only the recall of the 
previous solutions. If the time between testings is substantial (say 
a month) this isn't really a very serious problem. 

b. There is also a problem of test sensitization. An example will 
explain the term most easily. 

Suppose the topic is Texas history. The student is a foreigner — 
meaning not a Texas native, but a recent import. He knows nothing 
of the Texas saga. A question in the pre-test mentions the exploits 
of Big Jim Lane, but does not mention Giant Bill Sulhvan. Even if 
Big Jim and Giant Bill are of equal stature in the eyes of the 
Texans, our "foreigner" will be sensitized to Big Jim by the pre- 
test. He will pay special attention whenever Big Jim is mentioned 
in the instructional program; however, mention of Giant Bill will 
only elicit the usual amount of attention. 

Remember that a test is usually just a sample from a larger 
domain of content. If the students become sensitized to the sample, 
they will learn that particular content at a higher level than they 
win learn the rest of the domain. The growth (expressed as change 
scores from pre- to post-testing) will be spuriously high. 



Reliability or "Could We Find Our Way Again?" 191 

One way to handle this problem is to have only half of the group 
receive the pre-test, while the entire group takes the post-test. It is 
then statistically possible to ascertain the degree to which the pre- 
tested group was helped by their sensitization. The idea is good in 
theory, but has a very real practical disadvantage. What does the 
poor teacher do with those who are not tested at pre-test time? If 
one group in a room is being tested and a second is not, some 
interference is built up to the test. Which group would you prefer 
— the tested or not tested? Wouldn't you be a little "testy" if you 
got ticketed for the testing group? 

One solution is to give the non-tested students some other 
activity. Perhaps there is a different test. You might have them 
write some sort of theme — anything so they are not just sitting 
and enjoying themselves while a group of classmates is being 
tested. Of course, you might remove the "chosen group" from the 
room, but that's usually complicated by trying to find an empty 
room somewhere. 

Rather than just make up some task for the other group, all of 
the students might be pre-tested at the same time — ^but only with 
half of the test. Suppose you have a 50-item post-test written. You 
could make two pre-tests from this test — one 25-item test consist- 
ing of the odd-numbered items (given to half of the students), and 
one 25-item test consisting of the even-numbered items (given to 
the other half). The complete 50-item post-test wiU be adminis- 
tered to all of the students as a post-test. Now, half of the students 
will be sensitized to the odd-numbered items, and the other half 
to the even-numbered items. You will be able to see how well each 
group does on the items to which they had been sensitized, as well 
as the other items. All students will be tested the same amount and 
at the same times. 



Comparing the Techniques 

Test publishers will report one or more of the four types of reliabil- 
ity coefficients introduced in this chapter. By way of comparing, 
consider these parameters: 

1. Nimiber of forms required. Only the alternate forms technique 
requires more than one form. 

2. Number of testing sessions. Alternate forms and test-retest 
coefficients require two testing sessions with the same students. 



192 Educational Measurement 

Split-halves and Kuder-Richardson procedures can be handled with 
a single administration. 

3. Errors due to time. As the time between administrations 
increases for the test-retest and alternate form techniques, the 
coefficients will tend to diminish as well. With the other procedures 
(split-halves and Kuder-Richardson) there is no time interval since 
the coefficient is based on a single testing. 

4. Errors due to non-homogeneity of items. The Kuder-Richard- 
son coefficient is very sensitive to non-homogeneity. The other 
internal consistency technique (spht-halves) is also sensitive to 
non-homogeneity but to a lesser degree. With two administrations 
of a test to the same group (as with test-retest and alternate forms) 
the homogeneity question is not so important. 



Cautionary Notes on Interpreting Reliability Coefficients 

A key concept to remember about reliability is that each of the 
different kinds of coefficients is based on a Ust of scores from some 
real, live students. The size of the coefficient will depend on who is 
in the group. A test does not have a reliability coefficient. A test 
has a different coefficient for each different use. The statement, 
"the XYZ Achievement Test has a reliability coefficient of 0.85," is 
meaningless. The statement should describe the group tested and 
indicate which of the various reliability coefficients has been com- 
puted. 

The reason for demanding a clear statement of the type of 
reliability coefficient has been discussed earlier. Generally speaking, 
an internal consistency coefficient will be higher (for a given test) 
than will a coefficient which is computed by two testings with an 
intervening interval between. And if a test-retest or alternate forms 
coefficient is given, the length of the interval between testings 
should be given. 

Why is it important that the group with whom the test was 
used also be described clearly? The reason is that the more spread 
out the scores in the sample, the higher the coefficient will be. If a 
fifth-grade achievement test is used only with gifted students in 
fifth grade, the spread of the scores will be small, since the group 
is fairly homogeneous. The range would be much more if the sample 
were drawn from all normal fifth grades, but even larger if the 
sample included fourth and sixth graders as well. For the same test. 



Reliability or "Could We Find Our Way Again?" 193 

the coefficient computed for the gifted students would be smallest; 
the coefficient based on a sample of fifth graders next largest; 
and the coefficient based on a sample of fourth, fifth, and sixth 
graders the largest of the three. 



Question 

7. Five students take each of two tests twice with a three-week interval 
between testings. The results are as follows: 





TEST A 


TEST 


B 




First 
Testing 


Second 
Testing 


First 
Testing 


Second 
Testing 


Laura 
Carolyn 
Mary Jo 
Jerry 
John 


5 
4 
4 
4 
3 


6 
6 
5 
4 
4 


5 

4 
3 
2 

1 


6 
5 
4 
3 
2 



Compute the test-retest reliability coefficient for each test. 



In Test A, the range of scores in each testing is two points, 
whereas in Test B the range is four points. The ranks of the stu- 
dents remain constant. Although a few ties occur in Test A, no 
position change on the list occurs at any of the testings. If you have 
computed the reliability coefficients, however, you have seen the 
coefficient for Test B is 1.00, but the one for Test A is less than 
1.00. This occurs even though the rankings are the same. The range 
of the scores does affect the size of the reliabiUty coefficient. 

The length of the test is another factor affecting the size of the 
reliabiUty coefficient. All other things being equal, the longer the 
test, the higher the reliability coefficient. The Spearman-Brown 
Prophecy formula is a statement of the relationship between length 
and size of coefficient. Thus, if you see two similar tests but of 
different lengths (e.g. Test A of 30 items and Test B of 20 items), 
where the two have the same reliability, Test B is a better choice. 
If it had been 10 items longer, the reliability probably would have 
exceeded that of Test A. 



194 Educational Measurement 

Reliability and Criterion-Referenced Tests 

Remember that norm-referenced tests are designed to maximize 
the range of scores. Discrimination is important. To discriminate 
among a group of students, a test needs to be designed so that the 
scores are spread out. You cannot discriminate among students if 
all of them obtain the same test score. 

Criterion-referenced test items link student performance to a 
criterion. Between-student comparisons are of little importance. 
Thus, it is very possible that most of the students will satisfactorily 
complete an item. Furthermore, if the test is held until the teacher 
feels the students are ready to show mastery of the criterion per- 
formances, it is likely that most of the class will obtain scores at 
near the perfect level. In any event, the range of scores for a 
properly constructed criterion-referenced test will be seriously re- 
stricted when compared to a norm-referenced test of equal length. 
The restricted range of scores for this test will diminish the reh- 
ability coefficient, if it is computed by the usual techniques. 

The solution to the problem is simple. Don't compute reliability 
coefficients for criterion-referenced tests. A cop-out? No. For a 
criterion-referenced test the concept of "total score" is kind of silly 
anyhow. A criterion-referenced test of 10 items is based on 10 (or 
less) criteria. The proper way to report such test results is to tell 
each student which criteria have been mastered and which have 
not. If all 10 items reflect a given behavioral objective ("The stu- 
dent shall answer 9 of 10 long division problems accurately"), then 
you simply report "Yes, he did" or "No, he didn't" and a total score 
is unnecessary. 

Reliability refers to precision of measurement — but that is pre- 
cisely what the behavioral objective game is all about. Objectives 
stated in behavioral terms are supposed to be unambiguous and 
measureable. That's the same as saying "precise" and precision is 
what reliability is all about. Thus, to test the reliability of a 
criterion-referenced test, check to see that that the criteria are 
properly stated. Review chapter 2, if necessary. 



A Final Note on Correcting Limited 
Reliability Coefficients 

The maximum value for any correlation coefficient is 1.00. This is 
also the maximum for any reliability coefficient. The minimum 



Reliability or "Could We Find Our Way Again?" 195 

value is —1.00. Now look at the resiilts of Test A in the exercise. 
The rankings cannot be improved upon. The second testing did not 
result in any position changes on the list. This seems like it should 
yield perfect correlation (r = 1.00) but, in fact, the value is some- 
what less than 1.00. In other words, for these two lists of scores, 
there is no way for the correlation to reach 1.00. The correlation 
is, in fact, only 0.70. This leads to interpretation problems. 

If you report that your reliability coefficient is 0.70, most people 
will shrug and say something like, "Well, that's probably aU right, 
but it's sure nothing to get very excited about." They won't realize 
that you have gotten the highest possible correlation for these two 
lists of scores. Since most people interpret correlation coefficients as 
having a range from +1.00 to —1.00, you might present a corrected 
reliability (or correlation) coefficient instead. To correct the ob- 
tained coefficient, divide the obtained value by the maximum pos- 
sible value. 



' maximum 



'corrected — 'obtained / ' n 

Thus, for Test A, the corrected coefficient is 

/•corrected = 0.70/0.70 = 1.00 

which is a more accurate representation of what happened. The 
scores actually obtained could not possibly have been rearranged in 
any way to obtain a higher coefficient. 

To find a corrected coefficient in cases where the range of scores 
is clearly restricted, follow these rules: 

1. Find the actual or obtained coefficient in the manner pre- 
sented earlier. 

2. Find the maximum possible coefficient. To do this, each 
column of scores must be arrayed from largest to smallest. 
Then compute the correlation coefficient for these two columns. 
One note: If the obtained coefficient is negative, array one 
column from largest to smallest, and the other from smallest to 
largest. This will give you the maximum negative coefficient 
possible. 

3. Find the correct coefficient. A caution: Any time you re- 
port a corrected coefficient, you must make sure your audience 
knows that you have corrected the coefficient to have a range 
from +1.00 to -1.00. 



196 Educational Measurement 

Here are sample test scores from seven people with a restricted 
range test. Follow the computation through. 



STUDENT 


Pre-Test 


Post-Test 


Tricia 


10 


10 


Clint 


6 


8 


Nancy 


8 


9 


Peter 


7 


7 


Lynn 


7 


9 


Hiram 


9 


9 


Don 


9 


11 



JCpre 


= 


8 




^TpoBt 


— 


: 9 




Spre 


- 


V^ 


12 

7 


Spout 


= 


\ 


^ 


Tobt 


= 


0.73 



Scores rearranged for ru 



!-Test 


Post-Test 


10 


11 


9 


10 


9 


9 


8 


9 


7 


9 


7 


8 


6 


7 



12_ 
7 

io_ 

7 



— V^ V^^0.91 



_ 0-73 _ 80 
"""" - 091 - "•^" 



Question 

8. Find the corrected reliability coefficient for the following sample test 
scores from ten people with a restricted range test. 



STUDENT 


■I'EST A 


I'EST B 


Nancy 


8 


10 


Joey 


9 


9 


Henry 


9 


11 


Reed 


7 


6 


Susan 


8 


8 


Ellen 


6 


10 


Bill 


11 


9 


Yolanda 


7 


10 


Nadine 


8 


9 


Ted 


7 


8 



Reliability or "Could We Find Our Way Again?" 197 

Summary 

A test is a measuring device. Precision is a necessary ingredient in 
any measurement. Reliability is the testing term used to denote the 
test's precision. Another concept, more important than reUability, 
is test validity. A valid test is one which measures what it is sup- 
posed to measure. Obviously, a test cannot be valid without having 
an acceptable amount of precision (reUability) . Validity is the topic 
of chapter 9. 

The split-halves technique for computing reliability was pre- 
sented first. This is a procedure which is most amenable to class- 
room use. The procedure depends on computing a correlation 
coefficient between the two half- test scores (odd numbers and even 
numbers) for each of the students. The computational procedures 
have been detailed in the chapter. 

Three other types of reliability coefficients were also described. 
These included the Kuder-Richardson formulae, alternate forms 
procedure, and the practice of testing, then retesting, the same 
group of students. The Kuder-Richardson and split-halves coeffi- 
cients require one test which is administered only one time. The 
test-retest procedure also requires only one test, but the measure 
must be administered twice to the same group of students. The 
alternate forms coefficient depends on two closely related measures 
which are both administered to the same group of students. Some 
suggestions have been given regarding the uses and misuses of a 
test-retest format in assessing changes over time. 

A test has many reliability coefficients. The size of the coefficient 
is, to a certain extent, a function of the group tested. Knowledge 
about the group tested is mandatory to the intelligent interpreta- 
tion of a reliability coefficient. Some of the key factors related to 
the size of the reliability coefficient have been detailed in the 
chapter. Homogeneity of the group tested and range of scores 
within the group are two of the important factors. 

Reliability coefficients depend on the range of scores in the 
tested group — the wider the range, the higher the coefficient, all 
other things being equal. Criterion-referenced measures are not 
necessarily designed to provide a wide range of scores. In fact, cri- 
terion-referenced measures frequently provide very narrow ranges 
of performance. The notion of reliability is therefore difficult to 
interpret and apply to these kinds of tests. 



198 Educational Measurement 

The chapter ends with a description of a procedure which can 
be used to correct the size of reUability coefficients which are 
limited by the range of scores on a particular measure. The use of 
a corrected coefficient in any report, however, must be accompanied 
by a clear statement to the readers that this procedure has been 
followed. 



Validity: 

Are We At 

the Right Place? 



A motorist, driving west on Interstate 80 through Illinois, charts a 
path to Springfield. He leaves the controlled access highway at 
Iowa City, drives north through Iowa to the Minnesota border and 
on into Springfield, Minnesota. The only problem was that he 
meant to be in Springfield, Missouril His movements during the 
trip were not random. Each turn he made along the way would 
have been repeated in a test/re-test situation. The turns themselves 
were internally consistent, since each led to the final destination. 
In short, his movements were reliable — but they didn't take him 
where he meant to go! They were not valid. 

All measures are written with a purpose in mind. A measure is 
valid if it fulfills the purpose for which it was intended. 

Generally speaking, the measurement situation is like the dia- 
gram in figure 6. A one-to-one relationship does not occur between 
all possible behaviors shown by the blob at the left and the test 
items. That is, an item is not written for each behavior. The 
measure (or test) is just a sample from the domain.^ 

The concept of validity is illustrated by the broad arrow shown 
above the left-hand blob. Validity requires an independent check 
on the accuracy of the results. Before a measure can be vahd, it 
must be reliable. Reliability suggests precision. If a measure is 

1. Although, as pointed out in chapter 4, a criterion-referenced test can be 
constructed in some cases where every important and desired behavior is 
measured by one or more items. 

199 



200 



Educational Measurement 



VALIDITY CHECK 
(An Independent Assessment) 





THE MEASURE 
(Test) 



DOMAIN OF BEHAVIORS TO BE ASSESSED 
(Measured) 



Figure 6 

VALIDITY CHECK. An independent check on the accuracy 
of the measure's results. 

totally imprecise, it cannot measure anything accurately. Validity 
will be impossible. 

Reliability is necessary for validity, but a reliable test is not 
always valid. In fact, the errant driver was perfectly reliable in 
reaching the wrong destination. Precision is important, to be sure, 
but validity is the key concept. What good is a measure if it doesn't 
do what it's supposed to? 

In this chapter, three different vaUdity concepts (content, cri- 
terion-related, and construct), describing manners of measuring 
vaUdity, will be introduced. A section describing the expectancy 
table, a commonly used technique for presenting vaUdity results, 
will follow. The standard error of measurement is another common 
technique for interpreting or making decisions about test scores. 
The chapter ends with a series of suggestions and cautions appro- 
priate to the interpretation of validity coefficients. 

Briefly, the three validity concepts are: 

1. Content VaUdity, in which a systematic analysis of the do- 
main of behaviors to be covered by the measure is carried out. The 
outcome of this analysis is contrasted with the test content. 



Validity: Are We At the Right Place 201 

2. Criterion-Related Validity, in which results from the measure 
are compared with the results from a trusted and independently 
obtained measure of the same domain of behaviors. 

3. Construct Validity, in which the results of the measure are 
independently tested against what "reasonable people" would ex- 
pect to occur. Some illustrative examples will be given later in this 
chapter. 



Content Validity 

Content validity requires a systematic analysis of the domain to be 
measured. That seems like a strange idea since it's a task required 
of the test writer before developing the items. Authors of standard- 
ized achievement batteries begin by outlining the domain to be 
covered. Then they weight topics according to relative importance. 
The items are built to fit the outline. 

Does it follow, then, that any achievement test constructed in 
this manner will have content validity? The answer is an emphatic 
"No!" Why not? 

Look again at the diagram in figure 6. The validity check arrow 
implies an independent audit of the content. The idea is that the 
validation of the test's value will be from an outside source. Asking 
for an independent check of content validity with standardized 
achievement tests, even when the procedures involved in the valid- 
ity check are similar to the test construction procedures, sounds 
like an expression of "no confidence" in the test writer. The 
reasoning actually goes much deeper and gets at some important 
concepts related to interpreting validity coefiicients. 

A test — a measure of some kind — is not valid or invalid. A 
vaUdity label can only be attached after a use has been attempted. 
"Use" implies that some people will be involved. In fact, a whole 
variety of groups of people might respond to the items in the 
measure. 

Thus, you should never be told: "This test is valid. . ." or "This 
test is not valid . . ." because neither is complete. The validity 
statement should also tell you the purpose for which it was used 
and the people involved in the validity check. 

Think, for example, of the different uses which could be made 
of a test to measure problem-solving performance in junior high 
school science and math. The measure might be used to (1) identify 
high performers for placement in a special accelerated program; 



202 Educational Measurement 

(2) identify low performers for placement in a special tutorial 
section; (3) assess the performance of science teachers via the 
performance of their students; (4) assign course grades after a unit 
on problem solving; (5) audit yearly performance levels in a par- 
ticular district; (6) compare the effectiveness of different test 
series approaches to the question of problem solving. 

You probably can think of more. Can you see, though, how any 
statement about validity must be linked to a statement of use? 
Clearly, the problem-solving test might be valid in some of the 
situations above — but invalid in others. 

But use is just one part; the specification of the characteristics 
of the people involved in this use is another. Again, think of using 
the problem-solving test with such groups as: (1) a random sample 
of middle school students in the United States; (2) a group of 
recent immigrants into the United States; (3) middle school stu- 
dents from predominantly rural communities; (4) students identi- 
fied by teachers as having high science potential; (5) teachers of 
middle school science; (6) students who are part of a particular and 
identifiable minority group. 

Again, doesn't it seem obvious that any statement about validity 
must also be accompanied by some statement about the people 
involved? The test might be satisfactory for some of the above 
groups, but hopelessly inadequate for others. 

To analyze the content validity of a test, you begin with the 
group to be measured and the purpose of the testing with that 
group. These procedures should be followed: 

a. Make a list of the major instructional devices and the content 
of the devices which will be used with the people to be tested. 
Wherever possible, include the expected behaviors from the indi- 
viduals in the group. 

b. Do some sort of subjective weighting of the relative impor- 
tance of the topics or behaviors. If none of the particular behaviors 
seems to take a dominant position, then the test items should have 
a uniform distribution over the list. If, in your particular program, 
certain information, skills, or behaviors are more important than 
others, then the item selection for the test should be weighted 
accordingly. 

The test publisher starts from the other direction. He tries to 
write a test which fits the maximum number of learning situations. 
Usually this means that standardized achievement tests are written 
to conform to the "typical" classroom in the United States. The 
test may be constructed to have content validity for the typical 



Validity: Are We At the Right Place 203 

classroom, but in cases where the students, teachers, objectives, or 
instructional program differ from "typical," the content validity 
should be rechecked using the procedures described above. 

Recall from chapter 5 that the writer of a standardized test is 
somewhat constricted in the form of item he must prepare. For 
scoring convenience, recognition formats (multiple choice, true- 
false, matching) are preferred over supply-type items (completion, 
essay, or problem items). Check the format of the test against the 
list you make for (a) above. Do you agree that recognition items 
are satisfactory? Did you have certain objectives which included 
words like "recall" or "write a paragraph about" or "do problems, 
showing work"? A check of content validity should focus on both 
the content and the format of the items. 

The results which come from a content validity check are not 
numerical. The report of results contains comparisons of the test 
coverage, format, and the purposes for which the test is being used. 
A high amount of agreement between purposes and test coverage 
indicates high content validity. 

How is a criterion-referenced test constructed? You recall that 
first the behavioral objectives of the program are written and then 
items are devised to measure the attainment of these objectives. 
Checking the content validity of a criterion-referenced test is an 
easy task. Again, you start with the particular use you have in 
mind for the testing. You construct the list of major objectives and 
instructional devices. Now, however, you merely need to determine 
the congruence between your list and the list of behavioral objec- 
tives. A well-constructed criterion-referenced test will, by defi- 
nition, have content validity. The procedures for constructing the 
test and checking the validity are identical. Criterion-referenced 
tests go with the concept of content validity, which seems strange 
since a comparison of titles would put criterion-referenced tests 
with ciiteiion-related validity. 

A content validity check with typical performance measures 
(attitude, interest, or personality measures) is difficult. It implies 
a comparison between some instructional program or activities and 
the content of a test. When a measure concentrates on an interest 
or an attitude, it usually is not built around some particular set of 
activities. In typical performance measures, the domain of specific 
behaviors expected is difficult to list. Since no list of instructional 
activities is available and the list of behavioral outcomes is difficult, 
there is nothing to which to compare the content of the measure. 
Measuring content validity is impossible in such cases. 



204 Educational Measurement 

Questions 

1. Make a list of the five most important behaviors which, in your 
opinion, should come from a mathematics program by the time the 
student has finished sixth grade. Then, find a copy of the math 
section from a standardized achievement test for the sixth-grade 
level. Write a statement about the content validity of the test as it 
relates to your list of expected behaviors. Comment on both coverage 
and format. 

2. Is it possible that a standardized sixth-grade math test could have 
high content validity for District A, but low validity for District B? 
Discuss the implications of this statement. 



Criterion-Related Validity 

Criterion-related validity, in sharp contrast with content validity, 
requires two numerical measures. The first comes from the test 
score itself. The second is from an independent measure of the 
criterion. Look at figure 6 and think through this example: 

The Gouge-em-Good School of Pediatrics has developed a Pedia- 
tricians' Screening Test to be administered to college graduates 
with science majors who are interested in the field of pediatrics. The 
test is administered to a group of applicants to the school. Two 
criterion-related validity studies are planned, as follows: 

a. The well-accepted screening test of the pediatrician's profes- 
sional association is also administered to the group of applicants. 

b. When the current applicants finish old Gouge-em-Good, their 
final grade point averages will be correlated with the scores they 
received on the Pediatricians' Screening Test. 

Both of the validity studies are examples of criterion-related 
validity. The first measure will be administered at the same time 
as the screening test. Such a validity study is also called concurrent 
validity — concurrent meaning "at the same time." In the second 
measure, the scores were used to make a prediction about how the 
applicants would do in the program. Higher grade point averages in 
the school are apparently assumed to mean success in the school. 
The screening test scores are used to predict how much success 
each applicant will have in the school. This kind of criterion-related 
validity study is also called predictive validity. Thus, the two 



Validity: Are We At the Right Place 



205 



general classes within criterion-related validity are concurrent va- 
lidity and predictive validity. 

To compute the actual coefficients, the Pearson Product Moment 
Correlation Coefficient is used once again. As a review, work 
through these two questions. Refer to chapter 8 for help. 



Questions 
3. Compute the criterion validity (concurrent) . 





Score 


Score on 




Pediatrician's 


Well-Accepted 


APPLICANT 


Screening Test 


Screening Test 


Peter 


24 


100 


Lois 


26 


105 


Sandra 


23 


103 


Lynn 


20 


95 


Paul 


22 


97 


Compute the criterion validity (predictive) . 






Score 


Score on 




Pediatrician's 


Well-Accepted 


APPLICANT 


Screening Test 


Screening Test 


Peter 


24 


3.6 


Lois 


26 


3.0 


Sandra 


23 


2.4 


Lynn 


20 


2.7 


Paul 


22 


3.3 



Remember that all validity studies require an independent mea- 
sure of the criterion. In the first case, a second and "well-accepted" 
test is the independent measure. In the second case, it was an inde- 
pendent measure of the actual performance of the appUcants. 

Why would one try to develop the new measure when a "well- 
accepted" one was already available? The primary answer is con- 
venience. Suppose the "well-accepted" measure involves two days 
of careful interviewing by practicing pediatricians. Given the usual 
salaries of pediatricians, the expense of such a procedure is prohibi- 



206 Educational Measurement 

tive. If the new measure can replicate the results of the two-day 
interview in a satisfactory manner, it will be more convenient to 
use. A new measure is used to replace an old one if the new one 
can be shown to be quicker, cheaper, or simpler. 

The two criteria used in the pediatrician example exemplify two 
major classes of criterion-related studies. In the first category, the 
criteria are other measures of the same criterion variable which 
have become accepted as valid. The reasoning is something like 
this: If Test A measures the criterion in an acceptable and valid 
manner, and if Test B gives the same results as Test A; then Test 
B is as good as Test A. When IQ testing was first begun, the instru- 
ments were usually individually administered by trained psycholo- 
gists. The cost of each testing was substantial. But IQ group tests 
have been shown to correlate in a satisfactory manner with indi- 
vidualized tests. Thus, group tests are now widely used. Their 
validity is shown primarily through correlation studies with indi- 
vidualized tests. 

Academic criteria, such as the pediatricians' grade point aver- 
ages, are frequently used in criterion-related validity studies. Other 
types of academic criteria are high school or college rank, highest 
degree completed, grades in a particular course, or teacher ratings. 
Actual performance on a previously predicted task is another 
important category of criterion measures. For example, if the 
screening test is for prospective Morse code operators, the criterion 
measure would be performance of the trainees after the training 
program. A typing aptitude test is designed to screen for prospec- 
tive good typists. The most logical criterion measure would be 
typing performance after training. Industry uses a wide variety of 
screening tests for job applicants in areas like clerical work, pro- 
grammers, machine operators, or sales occupations. The perform- 
ance criterion measure for each task would be success in the job. 



Questions 

5. Think of two other criterion measures which might be used for the 
Pediatrician Screening Test. Remember that the criterion measure 
must have at least two levels on the scale (such as success or failure) . 

6. List five different criterion measures which might be used for a Test 
of Teaching Proficiency which is administered to all applicants in 
the education department in a certain university. 



Validity: Are We At the Right Place 207 

You'll find that validity coefBcients tend to be low in comparison 
to reliability coefficients. A close look at the criterion-related valid- 
ity process will show that inaccuracies can occur at three separate 
points: 

a. The scores from the measure in question (the predictor) will 
have an error component. 

b. The lack of congruency between the predictor coverage and 
the coverage of the criterion measure adds an error component. 

c. The scores from the criterion measure have an error com- 
ponent. 

How can errors occur in the actual measure? Every test has 
some amount of error involved. Usually, a test is a sample of a long 
list of expected behaviors. When the television networks predict 
the outcome of certain elections, they base the predictions on small 
samples. The predictions are accurate, but not perfectly so. The 
commentators will not predict races which are close because they 
know the predictions, based on the sample, contain a certain 
margin of error. A sample is an indicator of what is on the larger 
list; but it is not a replication. A sample must be assumed to con- 
tain some error. 

Thus, the test will contain a sampling error. Errors in results can 
occur for other reasons, though. The test may be selectively biased 
for or against some group like one of the two sexes, an ethnic group, 
or a racial group. Individuals sometimes react in a non-uniform 
manner based on short-term problems (an argument before the 
test, too little sleep, or a too-warm room). All of these contribute 
to errors in the predictor test. 

A test used to predict success in a pediatrics training program 
obviously is used before any of the applicants have been in such a 
program. The domain of behaviors sampled by the prediction test 
is different from the domain of behaviors being predicted. The dis- 
crepancies are great. A predictive test for Morse code operators 
might include a few items like dit-dah, dit-dit-dah, dah where the 
applicant has been given a brief training program to learn these 
three symbols. The three would be spaced a few seconds apart. A 
trained operator must know around forty symbols and copy up to 
one hundred characters per minute. The predictor test samples 
from a very limited domain when compared to the actual domain 
being predicted. Error occurs here as well. 

Finally, a recheck of the different criterion measures mentioned 
earlier will suggest errors. Academic measures of accomplishment, 



208 



Educational Measurement 



such as grades, grade point average, class ranks, or teacher ratings, 
all have commonly recognized error components. Do the best stu- 
dents always receive the best grades? Is it possible for a teacher to 
rank students without certain biases? As a matter of fact, ratings 
of various kinds are frequently used as an independent measure in 
criterion-related validity studies. A test to predict clerical accuracy 
might use, as an independent measure, ratings by a supervisor of 
the people tested earlier. A test to predict teaching proficiency 
might be validated by using principals' ratings. The inevitable 
biases which come from human interaction will tend to make these 
later ratings less than precise. In addition, the ratings are often 
made even though the supervisor or principal lacks certain impor- 
tant information about the person being rated. 

In short, the measurement of the validity coefficient is somewhat 
like measuring the distance between walls A and B with the tape 
measure shown in figure 7. You don't know exactly where to hold 



WALL A 





WALL B 



START 



-1 2- 



-3— 



Figure 7 

the tape to start with, indicating the error in the predictor. You 
don't know where to place the tape between the walls, since the 
distance is variable, indicating the error due to differences between 
predictor domain and criterion domain. You don't know exactly 
how to read the distance anyhow, since the units are unequal and 



Validity: Are We At the Right Place 209 

not marked by a specific line on the tape, indicating error in the 
criterion measure. No wonder validity coeflScients tend to be lower 
than reliability coefficients. 

In chapter 9, the point was made that the size of a correlation 
coefficient is affected by the range of scores on the two measures. 
If the range of scores is restricted, the coefficient will be lowered. 
In most prediction situations, the range will be diminished between 
the time of the two measures. For example, suppose the predictor 
is an academic aptitude test used to help colleges decide who will 
be enrolled in certain programs. Will all who originally take the test 
enter the college and complete the program? Obviously not, and 
those who do not complete the program will be anything but a 
random sample from the original group. In fact, the college may 
use the scores on the predictor purposefully to eliminate the lowest 
scorers. This restricts the range of responses on the criterion mea- 
sure and reduces the size of the validity coefficient possible. The 
frequently reduced range of scores in the criterion measure is still 
another reason for relatively low validity coefficients. 



Questions 

7. Six clerical aptitude scores have been obtained before the six named 
employees began work. Three different supervisors rated the em- 
ployees after six months. Supervisor A rated on a two-point scale 
(good or bad) ; Supervisor B used a three-point scale (good, average, 
bad) ; and Supervisor C used a five-point scale (very good, good, 
average, bad, very bad) . 

Compute the predictive validity coefficient for each supervisor's 
ratings. 

EMPLOYEE CLERICAL APTITUDE SUPERVISOR RATINGS 
SCORE ABC 



Carolyn D. 


20 


1 


2 


4 


Lynn A. 


18 


1 


2 


3 


Laura C. 


17 


1 


1 


2 


Robin F. 


17 





1 


2 


Marge G. 


16 








1 


Sandy M. 


14 












8. Note that the rankings of the three supervisors are identical. What 
can you say about the trend of the correlation coefficients as the scale 
becomes broader? 



210 Educational Measurement 

Construct Validity 

Construct validity checks are based on the "reasonable man (or 
woman)" theory. These examples should help clarify the term: 

Coach Smutz reported that three of his five starters are AU-Ameri- 
cans and that his top two reserves could make the starting five of 
any other team in the conference. At the end of the season, his team 
had won only 10 percent of their games. 

For this example, the original "measure" was Coach Smutz's 
evaluation of his players. The independent measure was the per- 
formance of his team against the conference opponents. A "reason- 
able man" would conclude that his original measure was not too 
accurate. The original measure was invalid. 

A questionnaire to determine attitude toward racial integration was 
administered to a random sample from a neighborhood. The scores 
indicated nearly unanimous approval of racial integration on the 
part of the neighborhood residents. In the year which followed the 
administration of the questionnaire, three families from minority 
groups purchased homes and moved into the neighborhood without 
any problems. 

A "reasonable man" would predict, based on the outcome of the 
questionnaire, that minority families could move into the neighbor- 
hood without problems. The independent measure was the actual 
integration of the neighborhood. The evidence does not contradict 
the "reasonable man's" hypothesis. 

A questionnaire is constructed for prospective orthodontists to de- 
termine reasons for entering this profession. The desire for mone- 
tary awards is ranked well below labels like "service,'' "challenge," 
and "research interest." A federal bureau later reports the income 
of orthodontists to be in the upper 1 percent of incomes for the 
population at large. 

A "reasonable man" would probably predict, based on the ques- 
tionnaire results, that the income level of the orthodontists would 
not be so high. The first measure was the statements of the pro- 
spective orthodontists. The independent measure was the actual 
income information. The evidence contradicts what a "reasonable 
man" would predict. 



Validity: Are We At the Right Place 211 

One hundred trainees in a television repair school are enrolled in 
twenty courses over a six-month period. Each course is evaluated 
on an A, B, C, D, F scale. At the end of the six months, the grade 
point averages are computed. Three years later, the grade point 
averages for the 100 are correlated with their yearly income derived 
from repairing televisions. The correlation is 0.60. 

A "reasonable man" would figure that the best repairman would 
receive the highest grades and end up with the highest average. 
Whatever traits led them to the highest averages, the reasoning 
continues, would also lead to the highest salaries. The correlation 
between grades and salary is fairly high, considering the errors 
which can result in validation coefficients. The evidence certainly 
does not contradict the "reasonable man" hypothesis. 

The examples lead to these thoughts about construct validity: 

1. The results from any measure should be reasonable in the face 
of your common sense. A test which shows one of your best stu- 
dents doing poorly should not be accepted without challenge. 

2. A construct is something people use to sort out the chaos of 
human interactions. Instead of saying, "Old Bob doesn't like to 
talk to people but does like to walk in the woods alone and is an 
accountant," people say, "Old Bob is sort of introverted." Introver- 
sion is a construct which implies other statements about Bob. 
Construct validity, then, is a check to see if the things that ought 
to happen together do happen together; or, in the negative sense, to 
see if the things that ought not happen together do not happen 
together. 

3. In two of the examples, the construct validation measure con- 
tradicted the original measurement. Coach Smutz's statement and 
the orthodontists' questionnaire results were both contradicted by 
evidence. The evidence causes the validity of each statement to be 
questioned. 

4. On the other hand, integration of the neighborhood and the 
income data on the television repairmen did not contradict the 
earlier measures. Note that "did not contradict" was used in 
the statement and not "supported." The difference may seem 
unimportant, but suggests this important distinction between sup- 
porting and nonsupporting independent measures. If a single 
independent measure should, to a reasonable man, support an 
earlier one, but does not, the earlier one is brought into serious 
question. But, if the single independent measure does not contra- 
dict the earlier one, the earlier one is still not proven. Construct 



212 Educational Measurement 

validity is based on an accumulation of independent measures 
which do not contradict the earlier results. 

5. Content validity, you will recall, ended up with a subjective 
statement about the extent to which the test matched the domain 
of objectives being measured. Criterion-related vaUdity is a numeri- 
cal comparison of one list of scores with an independently obtained 
list of scores. Construct validity can be either a subjective state- 
ment or a numerical measure. In the case of the coach and his 
basketball team, the first measure was subjective (a statement) 
and the second was numerical (the team's won-lost record). In the 
case of the orthodontists, the first statement was subjective (ques- 
tionnaire response which possibly could be translated into numeri- 
cal scores) and the second measure was a statement (which also 
could be translated into numbers). The TV repairmen example 
involved two lists of numbers used to find a correlation coefficient. 
Thus, evidence on construct validity can appear as a subjective 
statement or as a correlation coefficient. 



Questions 

9. For each of the following, designate one of these four: Content 
Validity, Criterion-Referenced Concurrent, Criterion-Referenced 
Predictive, or Construct. 

a. Ten of the last twelve lawyers to be disbarred in 

Illinois were graduated from the upper 10 percent 
of the XYZ Correspondence Law School, whose 
owners advertise it as an excellent school. 

b. The behavioral objectives upon which the Chen 

Achievement Test was based are shown to corre- 
spond to the general goals of the Miss Dirkes Little 
School for Boys. 

c. The Wyoming Test of Reading Readiness is shown 

to give results which correlate highly with three 
nationally published readiness tests. 

10. A graduate school which provides certification credit for public 
school principals constructs a series of performance measures as a 
final exam before graduation. Outline a validity study for the test 
based on 

a. Content validity 

b. Criterion-referenced validity (predictive) 

c. Construct validity. 



Validity: Are We At the Right Place 213 

11. For the test described in problem 10, create: 

a. Two examples of cases where a construct validity study would 
give negative evidence. 

b. Two diiferent examples where a construct validity study would 
give positive evidence. 



Interpreting Validity Studies: The Expectancy Table 

The expectancy table is a technique which is frequently used to 
report the results of validity studies. As the word "expectancy" 
indicates, information about probable future performance is trans- 
mitted via the table. One measure (the Predictor) is obtained from 
a person, and on the basis of this result, the person can be told the 
expected performance level on the second measure (the Criterion). 
Besides being used to report validity results, expectancy tables are 
used in reliability studies or simply to show the relationship be- 
tween two measures. You'll run across expectancy tables frequently 
enough to justify spending some time at this point understanding 
the construction procedures. This is going to be a "learn-by-doing" 
section. Get a piece of paper and a pencil so you can work through 
the steps as directed. 

BACKGROUND: An expectancy table is based on information 
on both measures for at least one group of students. For the first 
group, both measures (Predictor and Criterion) must be adminis- 
tered. After the information is available from this group, later 
individuals need only take the first measure (the Predictor) and an 
expected score can be assigned for the second. The second score will 
be phrased as a probability statement. For example: "Since you 
scored 33 on the Predictor, we expect a score of about 5 on the 
second measure. . ." or, "Your score of 33 on the Predictor indicates 
that the chances are 9 out of 10 you will succeed in this program of 
study." 

In this problem, the predictor is an individually administered 
battery of 75 items for 36-month-old children. The goal is an 
attempt at early prediction of a cognition problem in children at 
age 60 months so that ameliorative action can be taken before that 
time. The Criterion Measure is an observation of the children in a 
problem-solving situation. The Criterion is a score on a four-point 
scale, where a one indicates complete failure at the task, a four 



214 Educational Measurement 

complete success, and a two or three stands for the gradations 
between the two extremes. The process of establishing the expect- 
ancy table involved first administering the test to 80 children at 36 
months. Two years later, when the children had reached 60 months, 
each was observed in the problem-solving situation. Obviously, giv- 
ing expectancy figures for any of the first 80 children is silly, since 
a probability statement about expected result is a poor substitute 
for a statement of actual result. But once the expectancy table is 
completed, statements about a new crop of 36-month-old children 
will be possible. Table 5 shows the results. 

Table 5 
Expectancy Table Raw Data from 80 Children 



Child 


Predictor 


Criterion 


Child 


Predictor 


Criterion 


No. 


(36 mon) 


(60 mon) 


No. 


(36 mon) 


(60 mon) 


1 


33 


1 


34 


47 


2 


2 


33 


2 


35 


47 


1 


3 


34 


1 


36 


48 


2 


4 


34 


2 


37 


48 


2 


5 


34 


1 


38 


48 


3 


6 


35 


1 


39 


49 


3 


7 


35 


1 


40 


49 


2 


8 


36 


1 


41 


50 


3 


9 


36 


1 


42 


50 


1 


10 


36 


2 


43 


50 


2 


H 


37 


1 


44 


51 


2 


12 


37 


2 


45 


51 


3 


13 


38 


2 


46 


52 


2 


14 


38 


2 


47 


52 


4 


15 


38 


1 


48 


52 


3 


16 


39 


1 


49 


53 


2 


17 


39 


2 


50 


53 


2 


18 


40 


1 


51 


54 


3 


19 


40 


2 


52 


54 


2 


20 


40 


3 


53 


55 


1 


21 


41 


2 


54 


55 


2 


22 


41 


3 


55 


56 


3 


23 


42 


2 


56 


56 


3 


24 


42 


1 


57 


56 


4 


25 


43 


2 


58 


57 


3 


26 


43 


2 


59 


57 


2 


27 


44 


1 


60 


58 


4 


28 


44 


1 


61 


58 


4 


29 


44 


1 


62 


59 


3 


30 


45 


2 


63 


59 


3 


31 


45 


1 


64 


60 


2 


32 


46 


3 


65 


60 


3 


33 


46 


3 


66 


60 


2 



Validity: Are We At the Right Place 



215 



Table 5 (continued) 



Child 


Predictor 


Criterion 


ChUd 


Predictor 


Criterion 


No. 


(36 men) 


(60 mon) 


No. 


(36 mon) 


(60 


mon) 


67 


61 


4 


74 


64 




2 


68 


61 


2 


75 


64 




3 


69 


62 


4 


76 


65 




4 


70 


62 


4 


77 


65 




3 


71 


63 


3 


78 


66 




4 


72 


63 


4 


79 


66 




4 


73 


63 


2 


80 


66 




4 



Table 6 
Frequency Distribution of Predictor and Criterion Scores 



PREDICTOR 
SCORE 
CLASS 


FREQ 


CRITERION SCORES 
1 2 3 


4 


- 36 


10 


7 


3 






37 - 42 


14 










43 - 48 












49 - 54 












55 - 60 


14 










61 - 66 












Col TOTALS 


80 




30 




12 



Copy table 6 on a separate sheet of paper. Some of the informa- 
tion is given. Your task is to complete the display. The heading 
"Predictor Score Class" lumps together six scores per line. AU of 
the scores up to and through 36 are in the top class, at or between 
37 and 42 in the second, and so forth. The seven hatch marks on 
the first line under Category 1 stand for the seven people (see 
table 5) who had a predictor score of 33, 34, 35, or 36 and who also 
scored a 1 on the criterion task. Three of the ten children with 
scores through 36 scored a 2 on the later criterion measure. The 10 
in the "Frequency" column on the first line indicates that there 
were 10 children with scores at 36 or less. 

The table you have completed is a rough form of expectancy 
table. You can tell, for example, that no child who scored 36 or 
below was graded at levels 3 or 4 on the problem-solving test. Now 



216 Educational Measurement 

you're not sure that no child ever would score a 3 or even a 4 after 
the earlier score, but concerning a child with a 34 on the predictor 
measure, a statement such as, "It seems unlikely that he will score 
a 3 or a 4," would be appropriate. About half of the children who 
scored 61 or above were at the 4 level on the task, and about three- 
fourths at the 3 or 4 level. You could report regarding a child whose 
predictor was in the 61-66 class: "The chances are better than 
average that his performance at 60 months wiU be in one of the top 
two levels; and it is unlikely that his score on the problem-solving 
test will be at the lowest level." 

What good is an expectancy table which allows such statements? 
The tables are only of value when a decision must be made about 
individuals or programs. For example, suppose supporting evidence 
indicated that (a) a child who scored at the lowest (or 1) level on 
the criterion measure was likely to have a very hard time in kinder- 
garten; and (b) the situation could be corrected by extensive and 
expensive intervention techniques. Now, the school cannot inter- 
vene in every case because the cost is prohibitive. But the school is 
willing to "pay the toll" where the probability of real need is high. 
Of those who scored 36 or lower on the predictor, seven out of ten 
or 70 percent later scored at the lowest level on the problem-solving 
test. For students scoring below 42, twelve out of twenty-four or 
50 percent scored at the "1" level on the criterion measure. The 
decision about where to establish the cutoff point is a function of 
the urgency of the problem and the funds available for help, but 
the expectancy table is one technique which can be used to make 
such decisions based on data, rather than impressions. 



Questions 

12. Make a copy of the form of table 6. This time, however, change the 
individual entries into percentages. Make the percentages in each 
cell be "percentage of responses in the row." That is, the entry 
under Column 1 in Row 1 will be 70 percent, since seven out of ten 
responses in Row 1 occur in Column 1. (Your instructor has the 
completed table.) 

13. Complete the graph in figure 8 which is an alternative way of pre- 
senting the expectancy information. The top line will show the 
percentage of people at each category in the Predictor who score 2 
or less on the Criterion. The bottom line will show the proportions 
scoring at the lowest level on the Criterion. 



Validity: Are We At the Right Place 217 

Exercise 13 



100- 


X 




80- 


X 




60- 






40- 






20- 










X 




1 1 1 


— I ^ ? — 



-36 37^2 43-48 49-54 55-60 61-66 

% 2 or less = 100% 92% 4% 



% 1 or less = 70% 21% 0% 



Figure 8 
Categories from Table 6 

Decision making based on expectancy tables (which are really 
probability tables) takes some practice. If the task involved in the 
criterion measure is critically important, the humanist side of the 
decision maker says, "I don't care what this ameliorative action 
costs — any child with even one chance in 1000 of scoring a one on 
the criterion must be a part of the corrective program." But deci- 
sion makers in real settings aren't granted the luxury of being 
complete humanists. If every adult could be put through a complete 
three-day check up at the Mayo Clinic, much disease and early 
death would be prevented. The humanist side says, "Then let's do 
it." The practical decision maker sees the problems. 

Some advocate the widespread use of expectancy table informa- 
tion with individuals who have no experience in working with this 
type of report. Such use is questionable. Joe, an eighth-grade stu- 
dent, after completing a scholastic aptitude test, might be given a 
list of expectancies of success in various high school programs. 
These three entries might appear as those in table 7. The table 
shows probability of a C-average (or above) to be 60 percent in the 



A 


B 


C 


D 


2 


15 


65 


95 


7 


25 


80 


98 


10 


22 


90 


98 



218 Educational Measurement 

pre-college program and 80 percent and 90 percent respectively in 
the business or general programs. The reason for hesitancy in 
recommending this type of reporting is based on the different (and 
frequently unknown) manners in which people react to probability 
statements. Linked to this problem is the possibility that the 
differential reaction to probability statements is a function of 
ethnic group or socioeconomic level. 

Table 7 
Expectancies of Success 

PROBABILITY OF A GRADE AVERAGE 
PROGRAM AT OR ABOVE (in %) 

Pre-CoUege 

Business 

General 

The expectancy statement, after all, is only as good as the mea- 
sures upon which it is based. A scholastic aptitude test measures an 
individual's skill at performing school-like tasks. The performance 
is at least partly a function of previous experience. Success in the 
later program is partially a function of successful performance of 
the school-like tasks, to be sure, but it is also a function of things 
Uke motivation to succeed, neatness, social skills, thoroughness, 
and personal health, to name just a few. The inexperienced person 
will tend to look at the expectancy statement as inclusive of all 
of these factors and not make allowances for individual differences. 

But the possibility of harm in such a presentation technique goes 
deeper and has social significance. Look at the expectancy state- 
ments for the three high school programs once again. Do you 
suppose there could be a difference in reactions to the report based 
on socioeconomic status of the individual? Suppose Joe must decide 
between the pre-college and business programs. Do you suppose 
Joe's willingness to take risks is a function of the financial backing 
he has? Is Joe's racial group properly represented in the expectancy 
information? Do people of his race react in the same manner to a 
65 percent expectancy of a C-average or above as people from 
another race? What does 65 percent mean? It means Joe's chance 
is 65 out of 100 (or about 2 out of 3) that he will end up with 
passing grades (C or above). Would all people value that probabil- 
ity the same? Obviously not — some would be appalled at such a 
low probability of success and avoid the program. Others would 



Validity: Are We At the Right Place 219 

be delighted that the chances were that favorable and apply. Is it 
possible that this differential reaction is a function not only of 
individual differences, but is also a function of socioeconomic, 
racial, or cultural membership? 

You see, if the reaction to probability statements is a function of 
one of these other factors, then the use of such a reporting system 
may be systematically excluding members of certain groups from 
programs not on the basis of performance scores, but rather on the 
basis of group membership. An example should illustrate. 

Assume 20 students have exactly the same report as Joe. Ten 
were of socioeconomic level 1, the others at a lower level. All ten 
from level 1 choose the pre-coUege program; all 10 from the lower 
level do not choose the pre-coUege program. The reaction to proba- 
bility statements would seem to be a function of socioeconomic 
status. Because of differential reactions, a class separation occurs. 

The extent to which socioeconomic status, race, ethnicity, or 
other categorizations contribute to differential risk taking has not 
been completely documented. More research is needed. Until that 
time, however, the use of expectancy-type reports (of the type 
shown for Joe in the example) with people inexperienced in the use 
of such information, is discouraged. 

The Standard Error of Measurement 

Another technique for interpreting a validity coefficient is the 
standard error of measurement. 

Two primary bits of information are needed to describe a list of 
scores from a measure of any kind. If you have information on the 
location of the middle of the distribution (usually given by the 
mean) and information about the spread of the scores (usually 
given by the standard deviation), then an experienced observer 
can get a good picture of the distribution of scores. These two 
statements are for a list of scores for many individuals. 

Decision making based on validity statements often must be 
done for groups of people. In such situations, the mean and stan- 
dard deviation are appropriate statistics. But frequently, decisions 
also must be made about a single individual. In such cases, the 
appropriate measure of the middle of the distribution is that indi- 
vidual's score. The appropriate estimate of spread is called the 
standard error of measurement. 

But the individual only takes the test once. What is this talk of 
the spread of one individual's scores? A discussion of the standard 



220 Educational Measurement 

error of measurement is based on two assumptions, both of which 
are quite reasonable. These are: 

1. Error exists in any measure. The errors in educational mea- 
surements come from sampling bias in the domain of content 
covered, personal idiosyncracies in the individual, and all the other 
sources mentioned earlier. 

2. If the person were in a position to take tests over this content 
again and again and again, completely forgetting about the content 
of the items between testings, the assumption is that the scores 
would vary. This variation is described by the standard error of 
measurement. 

When an individual takes a test over some domain of content, a 
single score results. The two assumptions imply that the score is 
simply an indicator of the person's general performance level. The 
score indicates a performance range for this person, and not a 
performance point. The standard error of measurement is used to 
make probability statements about the person's scores such as "The 
probability is more than 99 out of 100 that Mary's real performance 
level is above average." Or, "Chances are 2 out of 3 that Peter can 
correctly answer somewhere between 60 percent and 75 percent of 
items like these." 

The computation of the standard error of measurement is accord- 
ing to this formula: 



S.E.M. = S.D. V 1.0 - REL 

where 

S.E.M. stands for standard error of measurement for each 
individual in the group; 

S.D. stands for the standard deviation of the scores for the 
tested group; 

REL is the reliability of the scores for group tested. 

Here is the procedure which is followed in obtaining the standard 
error of measurement. 

1. A test or measure of some kind is administered to a group of 
people. 

2. The standard deviation of the scores is computed (see the 
computational techniques described in chapter 8). 



Validity: Are We At the Right Place 221 

3. The reliability coefficient is computed. If two testings are 
possible, a test-retest or parallel form coefficient can be used. If a 
single testing is used, then the coefficient will be based on one of the 
internal consistency techniques. 

4. The standard error of measurement can then be computed. 

To illustrate, follow the procedures through for finding the stan- 
dard error of measurement for the bowling scores in the Men's 
American Legion Bowling League. If a test-retest technique is to 
be used for reliability, the group would (a) bowl two games, 
possibly separated by a few hours or a day; (b) the standard devi- 
ation of the scores would be computed; (c) the reliability of the 
scores would be found by computing the correlation coefficient of 
the two lists of scores (since each person has two scores) ; (d) the 
standard error of measurement would be computed by the formula 
given above. Of course, if practical problems make the use of a 
test-retest coefficient difficult, split halves coefficients could be used, 
based on odd and even numbered frames. 



Questions 

14. Compute the standard error of measurement for each case: 

a. Standard deviation 10, reliability 0.64. 

b. Standard deviation 10, reliability 0.90. 

c. Standard deviation 10, reliability 1.00. 

d. Standard deviation 10, reliability 0.00. 

15. In terms of precision of an individual measure, what do answers 
(c) and (d) to problem 14 mean? 

16. Here are the scores for two games for ten members of the bowling 
league: 

Game 1 Game 2 



Ed 


160 


175 


Pete 


130 


155 


Charlie 


160 


165 


John 


170 


175 


Frosty 


170 


165 


Gail 


150 


155 


Ken 


140 


135 


Russ 


150 


145 


Luke 


140 


135 


Zeb 


130 


145 



a. Compute the standard deviation (S.D.) of the scores for Game 1. 

b. Compute the test-retest reliability. 

c. Find the standard error of measurement. 



222 Educational Measurement 

Suppose, for the bowling scores, the standard deviation (S.D.) 
is 20, the reliability (REL) is 0.75, yielding a standard error of 
measurement of 



S.E.M. = (20) V 1.00 - 0.75 = (20) V 0.25 =: 20(0.50) = 10. 

Sammy, who is new to the league, rolls one game. His total for 
the game is 160. Based on this game, the members of one of the 
league's teams are considering adding Sammy to their roster. The 
team's chances at a trophy will be seriously hurt if Sammy's overall 
average is below 140. They would be happy if he would average 10 
pins above or 10 pins below 160, and they think Sammy would not 
fit in well if his average turned out to be in excess of 180. What 
kind of probability statements can be made about each of these 
concerns? 

1. The probability that Sammy will average less than 140 is 0.02, 
which translates to 1 chance in 50. 

2. The probability that Sammy will average between 150 and 
160 or between 160 and 170, is in each case 0.34 — about 1 chance 
in 3. 

3. The probability that Sammy will average higher than 180 
(and embarrass the rest of the bowlers) is 0.02 or about 1 chance 
in 50. 

The probabilities are based on the assumption of a normal distri- 
bution. The reasoning behind the assumption goes something like 
this: 

Whatever Sammy's real long term average is, sometimes he 
scores a good deal higher, sometimes a good deal lower, but usually 
he hovers pretty close to the real average. The distribution is sort 
of shaped like a bell — high in the middle and tapering off at either 
end. Thus, the probabilities associated with a similarly shaped 
mathematical model can be used to estimate chances of the various 
averages.^ 

Table 8 links the most frequently asked questions to probability 
estimates. This table was used to answer the probability questions 
about Sammy. What are the chances that he will average below 
140? The 140 is 20 points below his single observed score of 160. 
The S.E.M. is 10, so the score of 140 is 2 S.E.M. below the single 

2. The "similarly shaped mathematical model" is called the normal curve. 
The normal distribution is a statistical concept. For a full elaboration of its 
characteristics and use, consult nearly any elementary statistics text. 



Validity: Are We At the Right Place 



223 



observed score. From the table, the probability that the real per- 
formance level is somewhere lower than the single observed score 
and 2 S.E.M. below it is 0.02. Check the other three statements to 
see if you can use the table properly. 

Table 8 

Questions and Answers About the Probabilities of an Individual's 

Real Permance Level Based on a Single Score 



QUESTION 



PROBABILITY 



QUESTION 



What is the probability that 
the real performance level is 
somewhere between the single 
observed score and 

a. 1/2 S.E.M. above it? 


0.19 


What is the probability that 
the real performance level is 
somewhere between the single 
observed score and 

a. 1/2 S.E.M. below it? 


b. 1 S.E.M. above it? 


0.34 


b. 1 S.E.M. below it? 


c. 11/2 S.E.M. above it? 


0.43 


c. 11/2 S.E.M. below it? 


d. 2 S.E.M. above it? 


0.48 


d. 2 S.E.M. below it? 








What is the probability that 
the real performance level is 
somewhere higher than the 
single observed score and 

a. V2 S.E.M. above it? 


0.31 


What is the probability that 
the real performance level is 
somewhere lower than the 
single observed score and 

a. 1/2 S.E.M. below it? 


b. 1 S.E.M. above it? 


0.16 


b. 1 S.E.M. below it? 


c. 11/2 S.E.M. above it? 


0.07 


c. 11/2 S.E.M. below it? 


d. 2 S.E.M. above it? 


0.02 


d. 2 S.E.M. below it? 



But the concept of standard error of measurement was not cre- 
ated to help a teacher interpret decision making in bowling leagues. 
The concept was created to help a teacher interpret an individual's 
test score. Here are some examples of how a teacher can use the 
standard error of measurement in decision making. 

1. A final exam is administered by a high school social studies 
teacher to the 120 students in five sections. The standard deviation 
of the scores is 5.0, and the reliability is 0.64. Using the formula, 
this leads to a standard error of measurement of 



S.E.M. = (5.0) V 1-0 - 0-64 = (5.0) V 0.36 = (5.0) (0.60) = 3.0. 

After looking at the distribution of scores, the teacher sets the 
lowest passing grade at 25 items correct. Tom scored 22 correct. 



224 Educational Measurement 

What is the probabihty that his real, typical performance level on 
this domain of content was in the passing range (that is, at 25 or 
above)? Answer: Tom's score is 3 points, or one Standard Error of 
Measurement, below 25. What is the chance that his real per- 
formance level is somewhere higher than 1 S.E.M. above the single 
observed score? From the table, the answer is 0.16. In terms of 
decision making, this means that the probability is 16 chances in 
100, or about 1 chance in 6, that Tom's real performance level is 
in the passing range. Assuming she fails him, the chance is 1 in 6 
that she has erroneously done so. The chances are 16 out of 100 
that repeated testings would show Tom's average score at 25 or 
above. 

2. The Standard Error of Measurement for most IQ tests (scho- 
lastic aptitude tests) is about 6.0, based on a typical standard 
deviation of 15 or 16 and reliabilities of about 0.85. Mary Jo takes 
a single IQ test. Your report hsts her IQ as 120. Besides all of the 
other limitations of IQ tests mentioned in chapter 8, learn to 
interpret these scores in terms of ranges and probabilities instead 
of in terms of a single point. For example (reading from the table), 
the probability that her true IQ is between 120 and 126 (1 S.E.M. 
above her single observed score) is 0.34. The probability that it is 
between 114 and 120 (1 S.E.M. below her observed score) is the 
same— 0.34. That is, chances are 0.34 + 0.34 or 0.68 that her 
average performance over repeated testings would be between 114 
and 126. Be honest now, you react differently to a 114 IQ than to a 
126, don't you? At the 114, you sort of nod your head and say, 
"Well, that's okay— she'll make it!" With the 126 you raise your 
eyebrows and are hkely to be impressed. The point is that the range 
is wide — and that the chances are only 0.68, or about 2 chances in 3 
that her average performance would be in that range. For example, 
what is the chance that her average performance would be above 
132? The 132 is two S.E.M. units above 120. Reading from the 
table, the probability that her average performance is 2 S.E.M.'s 
above her observed score is 0.02, or about 1 chance in 50 — not a 
distinct possibility, but possible nonetheless. 

Don't back off at this point muttering, "If the damn things are 
so inaccurate, what's the sense of doing any testing at all?" The 
lack of a precise point estimate doesn't negate the usefulness of 
measurement in education. You must learn to treat test results as 
indications of the approximate location of a student's true ability 
on the domain of content in question. Remember, you're not sup- 
posed to categorize or pigeonhole pupils anyhow — you're supposed 



Validity: Are We At the Right Place 225 

to suspend judgment and look upon each individual as changing 
and changeable. No measurement, even in the physical sciences, is 
completely accurate. All measures must be used as suggestive of a 
general level of performance and a range of scores. 

Through the standard error of measurement concept, with the 
associated probability values, it is possible for you to attach num- 
bers to the decision-making process concerning an individual. In a 
certain sense, this standardizes the process and helps eliminate 
personal sorts of bias that so frequently enter into these kinds of 
decisions. But the attachment of numbers is far less important than 
the concept from which they derive — ^namely, that no measurement 
or evaluation a teacher makes, be it quantitative or non-quantita- 
tive, is absolutely precise. You must learn to interpret a single score 
as indicative of a general performance level which spans a range of 
scores. You must suspend judgment until an accumulation of such 
evidence is gathered. If you reject educational measurements be- 
cause they admittedly contain errors and must be interpreted in 
that sense, you must, by implication, reject any evaluation of a 
student which is error laden. Save for the evaluation by a coroner 
that the student has ceased living, you will not find any tests, 
measures, evaluations, or assessments which are devoid of any 
error component. Don't say, "Well, I just won't evaluate my stu- 
dents!" because every teacher must do extensive evaluation of 
individuals. You cannot make individualized decisions without 
individualized evaluations, and individualized decisions are the 
backbone of individualized instruction. 



Questions 

17. The expectancy table on p. 226 has been accumulated by a firm 
which hires and trains keypunch operators. The pre-test is a series 
of tasks designed for someone who neither types nor keypunches. 
The performance measure is supervisor rating after ten months on 
the job. 

a. Assimiing a rating of three is satisfactory, what is the probability 
that a person who scores ten or less on the pretest will be rated 
less than satisfactory? 

b. Bertha scores 25 on the pretest. What is the chance she will be 
rated "excellent" after ten months? 

18. A certain measure to predict college success has a standard error 
of 40 points (the overall mean is 500). Joel takes the test and 
scores 580. A college refuses his request for admission because its 



226 



Educational Measurement 



cutoff point is 600. What is the chance that, with repeated testing, 
Joel's average score would be 600 or more? 

FREQUENCY OF DIFFERENT RATINGS BY 
PRETEST SCORE RATING 



1 
(poor) 



3 
(satis) 



5 
(excel) 



(low) 1-5 






mm 

HI 


■M 11 




6-10 


mm 
m 11 


m m 

I 


TIH. 1 


•m. 


1 


11-15 


m "fflk 


m •««. 


1W. 


UH. -«l- 


HW- 


16-20 


>«( 11 


iX ^ 


HH 11 


TW- X 


IfN- 1 


21-25 


U 


UK ^ 


M U 


II 


III 


(high) 26-30 


1 


^ 




^^Ittk 


UK 



50 
50 
50 
50 
50 
50 
300 



Hints, Warnings, and Suggestions: Factors to Consider 
in Measuring and Interpreting Validity 

Any time there is an evaluation of a student or a group of stu- 
dents, the validity of the evaluation must be considered. This is 
true for both quantitative and non-quantitative evaluations. Valid- 
ity is the key concept — perhaps the most important of all the 
measurement concepts. Validity is a measure of the extent to 
which the evaluation you have made is a true representation of 
the behavior you are attempting to evaluate. 

With this in mind, the importance of properly interpreting the 
validity information is obvious. Some cautions and hints have been 
included earlier in the chapter. These, plus a whole series of new 
ones, are listed below. Look upon them as warnings of things 
which could happen as validity coefficients are used, and not 
necessarily as problems which always do develop. Treat the list as 
background information. 



Validity: Are We At the Right Place 227 

1. A validity statement or coefficient is meaningless in a vacuum. 
The concept has meaning only when linked to a specification of 

a. how the test was used; 

b. with whom the test was used; and 

c. the technique for obtaining independent information on the 
criterion. (Remember, in the diagram shown early in the 
chapter, that validity is determined by an independent mea- 
sure of the criterion.) 

2. The range of talent (or the range of scores) affects the size 
of the validity coefficient just as it affects a reliability coefficient. 
Generally speaking, the size of the validity coefficient will increase 
as the range of talent increases. This point also has been pre- 
viously discussed in this chapter. 

3. In trying to decide if published validity reports are appli- 
cable to a particular use you may have in mind, pay particular 
attention to these characteristics: 

a. Age of your group. Is your group approximately the same age 
as the group upon which the published validity information was 
based? If not, do you think performance on the measure in ques- 
tion is in any way a function of age? 

b. Sex of your group. Do you think the reported results are 
appropriate for males and females in an equivalent manner? Or do 
you think performance on the behavior in question is sex-related? 
If the behavior is sex-related, then separate validity statements 
should be made. 

c. Education and/or Cultural Experience. Do your students have 
approximately the same educational experience as the group upon 
which the validity statement was made? How about cultural back- 
ground? To the extent either differs, and to the extent to which 
performance on the behavior in question is a function of either, 
the published validity information must be interpreted with cau- 
tion. 

4. Take special and unique individual characteristics into con- 
sideration, such as emotional disturbances, test anxiety, and high 
or low motivation. Obviously, everyone has some level of emo- 
tional upset now and then. Most people have some test anxiety, 
and no two people have exactly the same motivational level. The 
caution here is to look out for the extreme case — the one which, 
in your impression, really stands out. 

5. Some rather mundane physical and environmental conditions 
can bring on invalidity in test results. If the testing room is too 
hot or too cold or too noisy or too crowded — or has a whole vari- 



228 Educational Measurement 

ety of other disruptive characteristics — the students are more likely 
to perform at levels which make decision making invalid. This 
category includes factors which could be controlled but are not. 
Thus, cheating, inaccurate scoring practices, and lack of satisfac- 
tory time allotments for the test all are included. Silly testing 
scheduling should be included. A "silly" time to test is one where 
the students' minds are obviously not going to be on the task at 
hand. Examples of "silly" test times are the last day or so before 
any vacation period, a Friday afternoon, the day before a major 
school event such as a sports event or social activity, the period 
immediately following violent physical exertion, or the scheduling 
of two major tests in a row. All of these — the physical conditions 
of the room, the chaotic condition of the classroom, or silly sched- 
uling practices — can lead to invalidity through the unnatural 
responses of the members of the group due to the uncontrolled 
(although able to be controlled) external factors. When a room is 
noisy, the scores are not reduced in a systematic manner. Some 
people are not bothered by noise and others are very much upset, 
and the amount of upset is not necessarily a function of per- 
formance level. The score distribution is altered from what it 
should have been under normal conditions. Some people are more 
sensitive to temperature change than others, some get very ex- 
cited about basketball games while others are unmoved — in each 
case the score distribution is altered by the extraneous factor. 
This sort of thing should not be allowed to occur. 

6. Before using a measure in decision making (which implies 
you have some confidence in the validity of the results), try to 
obtain some indication of the reliability of the instrument. A test 
which is unreliable cannot possibly be valid. An unreliable test is 
useless for any purpose. As pointed out in chapter 8, the reUability 
coefficient is affected by a variety of conditions, so no single rule 
can be used to answer the question, "How high is high enough?" 
As a general rule, to be applied cautiously, reliabilities in excess of 
0.75 are desirable, although tests with reliabilities in excess of 0.50 
can still be associated with valid results. If the reliability is in the 
0.25 to 0.50 range, the usefulness should be determined on a case- 
by-case basis. If the reUability is less than 0.25, the use of that 
measure in the decision-making process is questionable. 

7. Check any measure carefully for obvious cases of intervening 
variables. A test to measure computational skills, for example, 
should not provide results which depend heavily on reading skills. 
Such a result would occur if the directions for the test were com- 
plicated and printed. 



Validity: Are We At the Right Place 229 

Question 

19. Locate two standardized tests, one used primarily for college stu- 
dents, the other for elementary school students. Report on any 
validity studies reported for the tests. Which type of validity is 
reported? 



Summary 

Validity is a key concept. Generally speaking, a test consists of a 
sample of behaviors from a much larger domain of hoped-for be- 
haviors. The test is valid if the results based on the test are the 
same as if the results were based on all possible behaviors in the 
domain. 

The concept of validity implies two measures of the larger 
domain of hoped-for behaviors. The first is the test or measure in 
question. The second is an independently obtained measure of the 
domain. Presumably, the independently obtained second measure 
is a respected one and generally accepted to be accurate. If the 
first measure tends to give results which are in harmony with the 
second, the results are presumed to be accurate. 

Validity studies are categorized under three headings. Content 
validity requires a systematic analysis of the larger domain of 
behaviors to be covered. This is the generally accepted independent 
measure. The results of this analysis are then compared to the 
behaviors assessed by the test. The statement of content validity 
is a subjective comparison of these two. 

Criterion-related validity requires a second independent measure 
of the domain of behaviors. In this case, both measures (the test 
or measure in question and the independently obtained measure) 
are usually quantitative so the validity statement can be stated as 
a numerical coefficient. Sometimes these independent criteria are 
accepted measures which are administered at the same time (con- 
current criteria). Sometimes, the criteria reflect actual performance 
results — results which had been predicted earlier by the first mea- 
sure. A measure has construct validity if the results are reasonable. 
A construct is a psychologist's shorthand manner of describing a 
whole related list of behaviors under a single heading. If the re- 
sults of the measurement covering one part of that construct 
should fit reasonably into the entire scheme of things, then that 
measure can be said to have construct validity. 



230 Educational Measurement 

If a measure is valid, decisions can be made based on the re- 
sults. One technique for displaying results to facilitate decision 
making is the expectancy table. The procedures for constructing 
an expectancy table are not complicated, and the interpretation 
task is not complex. Cautions have been suggested in the use of 
expectancy tables with people familiar with this type of display. 
The standard error of measurement is another presentation tech- 
nique used in decision making with valid results. The use of the 
S.E.M. implies that an individual's score is suggestive of a range 
of scores at a general performance level. Computational and inter- 
pretive suggestions with the S.E.M. have been discussed. 

Finally, since validity is the key measurement concept, a series 
of hints, cautions, and suggestions were compiled in the final sec- 
tion. A test is not generally valid or invalid. A validity statement 
should include discussions of test use, people tested, range of 
scores, and criterion measure. A whole variety of testing condi- 
tions must be considered as the results of a testing are interpreted. 
Reliability information must also be obtained, since a test must 
have some degree of precision before there is any hope of validity. 



10 



Score Reporting: 

"Now That We're Here, 

Let's Write a Little Letter to 

the Folks Back Home" 



"Whajagit?!" 

Surely one of the more pervasive characteristics of a student is 
his desire to know the results of a test. One of the most annoying 
things a teacher can do is administer a test, but delay reporting 
the results for an extended time. The results of any testing pro- 
gram should be reported promptly.^ The reporting system should 
be such that the results are understandable and useable for those 
who need or are interested in the information. 

For certain kinds of tests a simple report of "number correct" 
(sometimes called "raw score") is perfectly satisfactory. The in- 
terpretation of raw scores is occasionally difficult, however, so a 
variety of derived score scales are commonly used in reporting 
systems. These include scales based on percentages, grade equi- 
valents, and standard scores. Each of these reporting systems has 
been developed to serve certain purposes. Each has its limita- 
tions. Discussions of advantages and disadvantages of each system 
are included here, along with some cautionary notes to users. The 
chapter ends with a section on marking. 

Where wiU you use this information on score reporting? Two 
places: 



1. One exception to the rule is a test standardization program where data 
are needed to improve the test items but where scores are frequently not 
reported to those tested. 

231 



232 Educational Measurement 

1. When you administer a test and have a special score report- 
ing need, you should be able to convert raw score information to 
percentiles, or various standard score scales. 

2. When you use the various norms (percentile, grade equivalent, 
or standard score norms) given by test publishers and reported for 
your students, your understanding of each type of norm and con- 
sequently your avoidance of misinterpretations should be im- 
proved if you have the experience of computing each type of 
derived score on your own. 

This chapter is constructed following the pattern of chapter 2. 
That means you've got to get your writing hand out of the pretzel 
box and do some work as you read. Before reading any further, 
get a pencil and some paper and be ready to do the problems as 
they are presented. The problems are an integral part of the pre- 
sentation so do them as you read. Don't skip by them until you 
have finished reading the entire chapter. 



Reporting Raw Scores 

The simplest reporting system requires only that you write the 
number right (or wrong) at the top of each test. Each student 
knows how many items were in the test. Each knows how many 
times his answers corresponded to the keyed responses. Answer 
these exercises before going on. 



Questions 

1. a. Construct a four-item spelling test for eighth graders. Build the 
test such that the results will make people think this group of stu- 
dents contains exceptionally good spellers. (Assume that only the 
results will be reported — a copy of the actual test is not included.) 

b. Now suppose you want to emphasize the point that more time 
should be spent on spelling in eighth grade. Build another four-item 
test such that it appears the group contains many really atrocious 
spellers. 

c. In general, the teacher can control the difficulty level of the tests 
constructed for classroom use. Before being able to interpret his 
own score, what other information besides raw score should a student 
be given? 

2. Are raw scores an appropriate reporting system for a criterion- 
referenced test? Will the raw score have meaning to the student 



Score Reporting 233 

even without information on how others have performed on the same 
measure? 



With a criterion-referenced test, a raw score reporting system is 
perfectly satisfactory. As a matter of fact, the raw score should be 
reported for each of the objectives upon which the test was built. 
The student does not need information on how his classmates or 
similar students across the country have performed on the items. 
The student need only know whether or not he has met the mini- 
mum acceptance level. 

For reasons suggested in chapter 3, most tests you write or use 
will not be criterion-referenced. In most criterion-referenced tests, 
the test writer is able to control the difficulty of a test just as you 
were able to do in question 1. Thus, a single score doesn't mean 
much to the student. If the items represent a sample of possible 
questions from a larger domain of content, the interpretation of the 
scores must be made on some sort of a comparison basis. With class- 
room tests, the comparisons are usually with the others in the room 
or with past students in that school. In the case of standardized 
measures, the comparison is with results obtained from carefully 
chosen norm groups. 

Of course, every combination of test and new group tested gener- 
ates a new raw score distribution. The number of items varies from 
test to test. So does the difficulty of the test. Add to these two 
conditions the varying abilities of different groups tested, and the 
noncomparability of raw score distributions is apparent. 

However, for the case of a teacher-made test written for class- 
room use, a raw score reporting system, accompanied by a distribu- 
tion of scores for comparative purposes, is enough. You could get 
fancy and add percentile ranks — especially if you want to get some 
feel for relative performance on more than one test. 



Questions 

3. Here are the scores on a test for 20 students: 



4 


14 


9 


17 


14 


5 


15 


15 


15 


11 


14 


14 


16 


16 


16 


10 


6 


10 


13 


6 



234 Educational Measurement 

Copy the table below and complete the "Freq." column. Leave room 
in all of the other columns as each will be needed for later exercises. 

Table 9 
Various Reporting Systems for Scores from 20 Students 

Score Freq. Cum. Freq. P.R. z T Stanine G.E. 

17 ^zz^z" mzz 3z^ ^izi zz^ zz^i 

16 ____ 

15 

14 ^___ 

13 

12 

11 

10 

9 

8 

7 

6 

5 

4 



Compute the mean and standard deviation. (Refer to chapter 8 for 
computational directions.) HINT: Both come out even as whole 
numbers. 



Percentile Ranks and Percentile Norms 

The concept of "percent" is used extensively in our culture — which 
probably explains why a percentile rank reporting system is gen- 
erally understood by teachers and students alike. A percentile rank 
tells you what percentage of the people tested scored below a given 
score. If Pete's percentile rank is 66, Pete knows that 66 percent 
of those in the reference group scored below him on this test. 

You surely have some idea about how one finds percentages. 
Finding percentile ranks is just a little more difficult. The added 
difficulty comes from a sort of technical point. To illustrate, look at 
the table of scores you began in question 3. Two scores of 6 are 
reported. What percentile rank will be reported to those two people 
who scored 6 correct? 

Percentile rank refers to the percentage of people below a certain 
point. One person in the list scored only 4 correct, and one other 
person scored 5 correct. No other scores lower than 6 are seen in the 
list of 20 scores. Strictly interpreted, then, the percentage of scores 
below 6 is (2/20 X 100%) or 10 percent. Strictly interpreted, the 
percentile rank of a person scoring 6 correct is 10. 



Score Reporting 235 

The technical point enters in when the assvunption is made that 
a score of 6 is an estimate of all performance levels between 5^/^ 
and 6%. Assuming the teacher does not give half or quarter points, 
the only scores possible are whole numbers Uke 4, 5, 6, and so forth. 
But performance (knowledge, ability, or whatever is being mea- 
sured) does not come in discrete increments. Student A might be 
just a little bit better than Student B. The real differences between 
students need not be at fixed intervals. The category score 6, then, 
presumably included students at all performance levels between 
51/2 and 6 1/2. 

Of course, you don't know which of them is a "more than 51^ 
but less than 6" type of person and which is a "more than 6 but 
less than 6i^" type, since all have scored exactly 6 items correct. 

Test people take the simplest way out. They simply assume that 
half of the scores represent real performance levels below the score 
6, and the other half represent scores above the point 6. Thus, to 
compute the percentile rank of a person scoring 6 correct, you first 
find the number of scores clearly below 6; that is, scores of 5 or less. 
Add to this amount one half of the scores at 6 and divide the sum 
by the total number taking the test. That is, 

P.R. of score point 6 = 

Number of scores below 6 -f ^^ of scores at 6 X 100 
Number of people tested 

Two scores are below 6. Two are at 6 and % of 2 = 1. The 
percentile rank of score point 6 is (2 + l)/20 X 100 or 15. 

An example plus questions 5 and 6 should illustrate the proce- 
dures. Just to avoid the confusion which could result if you made 
a small error in question 3, the frequency distribution is given 
below: 

Score Freq. Score Freq. Score Freq. 



17 


1 


12 





7 





16 


3 


11 


1 


6 


2 


15 


3 


10 


2 


5 


1 


14 


4 


9 


1 


4 


1 


13 
IXAMPl 


1 


8 





3 






Ed scores 13 correct. What is his percentile rank? 

a. How many scores are less than 13? Answer: 8 

b. How many scores are at score 13? Answer: 1 



236 Educational Measurement 

c. Figuring that half of the scores at 13 are below 13, how many are 
thought of as less than 13? Answer: V^ 

d. P.R. = 8 + Yz X 100 = 42.5 

20 

Now that may seem a httle silly — figuring that half of the score 
is below 13 and half is above. The reasoning is that there is a 50 
percent chance that Ed's real performance level is actually below 13 
so that one half of his score is figured into the percentile rank. 

One other technicality: Percentile ranks are usually not reported 
with a decimal point — usually you'll round them off to the nearest 
whole number. If the decimal fraction happens to be exactly one- 
half, follow this procedure for consistency's sake: Round to the 
nearest even number. Thus, 42.5 rounds to 42; 17.5 rounds to 18; 
and 55.5 rounds to 56. 



Questions 

5. What is the percentile rank of a person who scores 12 correct? 

a. How many scores below 12? 

b. How many scores at 12? 

c. Compute percentile rank. 

6. What is the percentile rank for a person scoring 16 correct? 

a. How many scores below 16? 

b. How many scores at 16? 

c. What is V2 of the number of scores at 16? 

d. Compute the percentile rank. 

7. What is the percentile rank for a person scoring 15 correct? 



If you must compute each one of these percentile ranks indi- 
vidually, the task would be pretty tedious. However, there is a 
simple technique for quickly finding the percentile ranks of every 
student in the class. The task is especially easy if you happen 
to have a desk calculator or adding machine available — but neither 
is absolutely necessary. Here's how you do it: 

a. First, divide 100 by two times the number of students who 
took the test. For the example, 20 students took the test so the 
computation is 100 _ 100 _ „ r 
2 X 20 "40 ~ 



Score Reporting 237 

b. For every individual at a score point, write in 2.5 twice. Ar- 
range them at each score point as shown below such that it will be 
easy to split them into halves later. 

Table 10 
Quick Percentile Rank Computing Technique 



Computing 
Score Freq. Column 



17 


1 


16 


3 


15 


3 


14 


4 


13 


1 


12 





11 


1 


10 


2 


9 


1 


8 





7 





6 


2 


5 


1 


4 


1 



2.5 
2.5 


P.R. for 17 


= 


97.5 - 


98 


2.5 2.5 2.5 


P.R. for 16 


= 


87.5 = 


88 


2.5 2.5 2.5 










2.5 2.5 2.5 


P.R. for 15 


= 


72.5 = 


72 


2.5 2.5 2.5 










2.5 2.5 2.5 2.5 


P.R. for 14 


:= 


55 




"275 2.5 £5 2^ 










2.5 


P.R. for 13 


= 


42.5 = 


42 


2.5 













P.R. for 12 


— 


40 















2.5 


PR. for 11 


= 


37.5 = 


38 


2.5 










2.5 2.5 


P.R. for 10 


= 


30 




2.5 2.5 










2.5 


P.R. for 9 


= 


22.5 = 


22 


2.5 













P.R. for 8 


= 


20 




_____ 













P.R. for 7 


— 


20 















2.5 2.5 


P.R. for 6 


= 


15 




2.5 2.5 










2.5 


P.R. for 5 


= 


7.5 = 


8 


2.5 










2,5 


P.R. for 4 


= 


2.5 





2.5 

Note that 2.5 appears twice in the "Computing Column" for each 
entry in the "Freq." column. The 2.5s are split so that one appears 
below and one above the dashed line for each score point. 

Computation of percentile ranks for all score points is a simple 
matter. You accumulate all the 2.5s which appear below the dashed 



238 Educational Measurement 

line for the score point. One 2.5 appears below the dashed line for 
score point 4, so the percentile rank is 2.5 (which rounds off to 2). 
To find the percentile rank for a score of 5, simply keep adding the 
new 2.5s which appear up to the dashed line at 5. You can obtain 
the percentile ranks for every score point written in just a few 
minutes once you are famihar with this technique.^ 



Question 

8. Copy the frequency distribution shown below in the form of Table 10. 
Using the quick percentile rank computing technique, find the per- 
centile rank of every score point. 

Score Frequency Score Frequency 



100 


4 


86 


1 




99 


1 


85 


2 




98 


2 


84 


9 




97 


5 


83 


4 




96 


3 


80 


3 




94 
92 


7 
4 




TOTAL 


50 


90 


5 









At this point you should be able to translate raw score test re- 
sults to percentile ranks. That fulfills one part of the chapter ob- 
jectives. The second part deals with the use of percentiles in tables 
of norms for standardized tests. 

Although the data for question 3 are fictitious, if this had been 
a standardized test, the publisher's manual might have included 
percentile norm information as shown in table 11. 

Assume the test from question 3 was administered to third-grade 
children. A student who scored 10 correct has a national percentile 
rank of 55. (See the entry under grade 3 and opposite the 55 under 
the percentile rank column.) This means the child scored higher 

2. One other computational comment: Sometimes the result of dividing 
100 by two times the number tested comes out to a nasty looking decimal — 
instead of the nice, even one of the example. In most cases where class size 
is about 30, it is perfectly satisfactory to carry two decimal places. If you 
have 21 students, 100/42 = 2.39, rounded to the second decimal place. 



Score Reporting 



239 



Table 11 

Percentile Norms for a Random National Sample of 3rd, 4th, 5th 

and 6th Grade Students 



Percentile 








Rank 




Score by Grade 






3 


4 


5 


6 


55 


10 


12 


14 


16 


54 


9 


10 


13 


15 


53 


8 


9 


11 


13 


52 


7 


7 


10 


11 


51 


6 


6 


9 


10 


50 


5 


5 


8 


9 



than 55 percent of the third-grade students in the national stan- 
dardization sample. Reading the grade 4 column, the score of 10 
translates to a percentile rank of 54. The score of 10 exceeded that 
of 54 percent of the fourth graders in the national standardization 
sample. 

Percentile ranks and percentile norms are widely understood. 
The use of these systems is encouraged for reports to students and 
parents. However, three important cautions must be noted in the 
interpretation of percentile norms: 

1. The reference group upon which the norms are based must be 
reported very explicitly. 

2. The comparison to the reference group must be a legitimate 
one. 

3. Usually, percentile rank differences at the extremes (high and 
low ends) have more meaning than differences at the middle of the 
distribution. 

The rationale for the explicit reporting of reference group infor- 
mation should be obvious. A score of 10 on the little test earns a 
percentile rank of 55 when compared to a random national sample 
of third graders, but only a rank of 51 when the comparison group 
is sixth graders. If the score of 10 had been compared to a random 
sample of children with learning disabilities, the percentile rank 
might have been in the 90s. The comparison means nothing if the 
group upon which the norms were based is not clearly specified. 

You must go one step further and not only know the details of 
the reference, however. The characteristics of the student being 
compared must conform to the characteristics of the reference 



240 Educational Measurement 

group. If the student is atypical of the reference group, the com- 
parison is unfair and meaningless. 

What are some cases of nonconformity between student and 
reference group? The best way to look at the question is to think 
of the characteristics of the typical reference group. The typical 
group consists of some kind of random sample from the entire 
population of students at a given age level. The racial, socioeco- 
nomic, and ethnic characteristics of the sample will roughly reflect 
those of the larger population. About half will be males. The in- 
structional program through which the students in the sample pass 
will roughly reflect the typical instructional program in the country. 

Meaningless comparisons will occur when the student's program 
or background differs sharply from the typical one. If the student 
barely speaks English; if the student has had private tutors all his 
life; if the school has delayed the introduction of arithmetic at least 
two years beyond that of a typical school; if the school has tripled 
the instructional time in the tested area — each of these is an exam- 
ple of a case where the characteristics of the student differ sharply 
enough with the characteristics of a typical sample to cause a com- 
parison to have limited meaning. 

Percentile norms are easy to construct. Local norms can be 
accumulated, especially in the atypical types of situations like 
those described above. A single test could have many different 
tables of norms associated with it, linking performance to a whole 
variety of reference groups. 

The third cautionary comment is a technical point and is best 
illustrated by an example. Usually, scores on any sort of measure 
form a somewhat bell-shaped distribution. A pervasive character- 
istic of most groups, when confronted with a task, is that a few 
will be exceptionally good at the task, a few exceptionally poor, and 
the rest will bunch up around the average. The scores on most 
measures are likely to assume a distribution something Uke this 
hypothetical one: 

Percentile 
Score Frequency Rank 



10 


1 


99 


9 


2 


96 


8 


12 


82 


7 


20 


50 


6 


12 


18 


5 


2 


4 


4 


1 


1 



Score Reporting 241 

The mean of the distribution is 7, and most of the scores are 
around the middle — a typical sort of result for most measures. Look 
at these three pairs of students: 

Tom: Score 6, P.R. 18 

Score difference is one point. 
Dick: Score 7, P.R. 50 P.R. difference is 32 percent- 
age points. 

Harry: Score 9, P.R. 96 

Score difference is one point. 
Hubert: Score 10, P.R. 99 P.R. difference is 3 percent- 
age points. 

Louis: Score 8, P.R. 82 

Score difference twice as large 
Jane: Score 10, P.R. 99 (2 points) as between Tom 

and Dick, but 
P.R. difference is 17 percent- 
age points. 



At the middle of the distribution, small changes in number cor- 
rect cause large differences in percentile ranks. The uninitiated 
will have a tendency to overinterpret these differences. After all, 
the difference between Tom's P.R. of 18 and Dick's P.R. of 50 
appears drastic — yet the score difference was a single point. At the 
extreme, the same one point change resulted in a P.R. change of 3 
percentage points. Differences at the extremes mean more than 
equal-sized differences in the middle. For example: 

Tom's percentile rank in reading is 99. In arithmetic it is 89. 
Clyde's P.R.s in reading and arithmetic are 55 and 45, respectively. 
The difference for the two is the same (10 percentage points), but 
the real difference is more meaningful in Tom's case. 

Clem's P.R.s in reading, arithmetic, and spelling are 90, 70, and 
50, respectively. The difference between reading and arithmetic and 
the difference between arithmetic and spelling are the same (20 
percentage points), but the 20 percentage points at the extreme 
(between arithmetic and reading) have more real meaning than do 
the 20 points at the middle (between arithmetic and spelling). 

Because of the inequality of the units in a percentile rank scale, 
all arithmetic operations (adding, subtracting, multiplying, and 



242 Educational Measurement 

dividing) are inappropriate. Not only should you avoid working 
with the differences between pairs of percentile ranks, but you must 
also avoid computing average percentile rank. If your students' 
scores are reported in your grade book as percentile ranks, you 
cannot at the end of a marking period determine average percentile 
rank. This difficulty has led test pubUshers to create another re- 
porting system called standard scores, which are described next. 



Standard Scores and Standard Score Norms 

A variety of standard score systems can be used, but all are based 
on the same principle. A standard score reports to the student the 
number of standard deviations his score was above or below the 
mean. To show why a standard score system is needed for com- 
parison, consider Ed's scores on these two tests: 



Test 


Test 


A 


B 


40 


80 


10 


20 


50 


85 



Average for the Group 
Standard Deviation 
Ed's Score 

Ed's scores were above the mean for both tests, and his raw score 
was certainly higher on Test B, but relative comparisons are diffi- 
cult. One transformation is a basic raw-score form. In this system, 
a score is simply translated to standard deviation units. For Test A, 
Ed is 10 points above the mean. Since the standard deviation is 10 
points, Ed is 1 standard deviation above the mean for Test A. In 
Test B, he is 5 points, or % standard deviation above the mean. 
Actually, then, his performance was better in Test A, assuming the 
reference group was the same in both cases. Since the numbers in 
the example are convenient, the calculations can almost be done by 
intuition. Find the difference between Ed's score and the mean, 
then divide by the standard deviation. Some people like computa- 
tional formulas, so the intuitive formula can be changed to the 
formal one: 

Score — Mean 
z = 



Standard Deviation 



The symbol z is commonly used for the system of standard scores 
where a student's score is reported as the number of standard devia- 



Score Reporting 243 

tion units above or below the mean. Just to make sure you have 
mastered the procedures, consider these examples: 

1. The mean of a distribution is 17 and the standard deviation 
is 5.0. Translate these scores to 2:-scores: 10, 22, 30, and 7. 

Score of 10 _ Score - Mean _ 10 - 17 _ -T__ _ ^ ^ 
becomes: "~ Standard Deviation ~ 5 5 

Score of 22 _ 22 - 17 _ _5^ _ , ^ q 
becomes: ~ 5 ~ 5 ~ ' 

Score of 30 ^ _ 30- 17 _ J^ _ , 2 g 
becomes: ~" 5 ~ 5 ~ 

Score of 7 _ 7-17 _ -10 _ _2 q 

becomes: ~ 5 ~ 5 ~ ' 

2. Convert these test scores to z-scores: 23, 17, 19, 21, and 20. 

a. The mean of the scores is 

23 + 17 + 19 + 21 + 20 _ 100 _ „ 
5 - 5 ~ "^^ 



b. The standard deviation is 

Score Mean Dev Squared Dev 
23 20 +3 9 



17 


20 -3 


9 




19 


20 -1 


1 




21 


20 +1 


1 




20 


20 
Sum of Squared 









Deviation 


20 






Standard 
Deviation = 


r.- 


V4 


c. The z-scores are: 








23 - 20 


3 




For 23 z = 


2 


~ T~ 


+l.b 




17 - 20 


-3 




For 17 z = 




— — 


—1.5 



244 Educational Measurement 

V in 19-20-1 

For 19 2 = ^ = — ^ = —0.5 



For 21 z = 





2 




21 


— 


20 




2 




20 


— 


20 



+1 


+0.5 


2 ~ 



2 ~ 


0.0 



For 20 2 = 

3. What is the mean 2-score? 

+ 1.5 - 1.5 - 0.5 + 0.5 + 0.0 

Mean z = = — 

o o 

4. What is the standard deviation of the 2-scores? 

2-Score Squared 

Score 2-Score Mean Dev Dev 



23 


+1.5 





+1.5 


2.25 


17 


-1.5 





-1.5 


2.25 


19 


-0.5 





-0.5 


.25 


21 


+0.5 





+0.5 


.25 


20 


0.0 





0.0 


.00 



Sum of Squared Deviations: 5.0 
Standard Deviation of Z-Scores /50 



Questions 

9. Complete the z-score column for the data given in question 3. 

10. Compute the mean for the list of 20 2-scores computed in question 
9. (The answer should be zero.) 

11. Compute the standard deviation of the list of 20 2-scores. (This 
may be a little tedious since fractions are involved — but persevere! 
The answer is 1.) 



The 2-score system avoids the problem of unequal units en- 
countered with percentile ranks. Arithmetic operations can be 
carried out on scores which have been translated to 2-scores. Both 



Score Reporting 245 

the example and exercises above point to two minor difficulties with 
z-scores: (1) they use negative numbers and (2) decimal fractions 
are involved. Since many people are uncomfortable working with 
negative numbers and/or decimals, alternative standard score sys- 
tems have been initiated. These systems still report scores as a 
function of the number of standard deviations of each person from 
the mean. 

Rather than report the number of standard deviations above or 
below zero, these systems set the mean at some point higher than 
zero (say at 50) so that scores below the mean do not become 
negative numbers. Rather than use 1.0 as the standard deviation 
(like the z-score scales do as computed in question 11), they use a 
much larger number to stand for one standard deviation unit. A 
portion of a standard deviation unit can still be a whole number. To 
establish a system with a conveniently chosen mean and standard 
deviation, you must first find the list of z-scores. Then, follow this 
formula for each score: 

New Score = (Desired Standard Deviation) (z-score) 
-f Desired Mean. 

If, for example, you desire to have the new standard deviation be 10 
and the new mean be 50, the formula becomes 

New Score = lOz + 50. 



Examples 

5. The mean of the distribution is 17 and the standard deviation 
is 5. Convert these scores such that the new mean is 100 and the 
new standard deviation is 20: 10, 22, 30, and 7. (The z-scores were 
computed in example 1 and are — 1.4, 1.0, 2.6, and — 2.0, respec- 
tively. 



For the 10: New Score = 20z + 100 





= (20) 


(-1.4) 


+ 100 






= -28 


+ 100 








= 72 








For the 22: 


New Score = (20) 


(1.0) 


+ 100 = 


120 


For the 30: 


New Score = (20) 


(2.6) 


+ 100 = 


152 


For the 7: 


New Score = (20) 


(-2.0). 


+ 100 = 


60 



246 Educational Measurement 

6. Convert the scores from example 2 into new scores with a 
mean equal to 50 and standard deviation of 10. 

Conversion 



bcore 


z-score 


23 


+1.5 


17 


-1.5 


19 


-0.5 


21 


+0.5 





0.0 



New Score = (10) (1.5) + 50 = 65 

New Score = (10) (-1.5) + 50 = 35 

New Score = (10) (-0.5) + 50 = 45 

New Score = (10) (+0.5) + 50 = 55 

New Score = (10) (0.0) + 50 = 50 



7. What is the mean of the five new scores? 

XT A,r 65 + 35 + 45 + 55 + 50 _„ 

New Mean = ■ *-= ■ ■ = 50 

5 

So the new distribution does have a mean of 50 as desired. 

8. What is the standard deviation of the new distribution? 

New New Dev. 

Score Score Mean Deviation Squared 



23 


65 


50 


+15 


225 


17 


35 


50 


-15 


225 


19 


45 


50 


-15 


25 


21 


55 


50 


+ 5 


25 


20 


50 


50 









Sum of Squared Deviations 500 

New Standard Deviation = [SOO" = VTOO = 10 



What a surprise! 



Questions 

12. Complete the column headed "T" for question 3 such that the new 
mean is 50 and the new standard deviation is 10. 

13. Find the mean of the new scores computed in question 12. 

14. Find the standard deviation of the new scores computed in ques- 
tion 12. 



Score Reporting 247 

These transfonnations, either to a z-score or to some other more 
convenient new-score scale, have practical applications in the class- 
room. If all test scores are translated to the same scale (say with 
mean 50 and standard deviation of 10), the comparison, interpre- 
tation, and averaging of scores are all facilitated. In fact, older 
students, once they have become initiated to the use of a standard 
score scale, will find their test results easier to interpret if reported 
in this manner. 

This ends the discussion of standard score types that the teacher 
can actually compute and find useful for the classroom. You should, 
by this time, be able to take a list of scores, find the mean and 
standard deviation, and change the scores to have any chosen new 
mean and new standard deviation. 

Test publishers use standard scores extensively. Rarely are 
scores from standardized tests reported as raw scores. College en- 
trance tests, such as those published by the College Entrance Ex- 
amination Board, have a transformed score scale with a mean of 
500 and a standard deviation of 100. The same mean and standard 
deviation have been chosen by the publishers of the Graduate 
Record Exam. A score of 600 on either test does not mean the per- 
son scored 600 items correct (the test contains far less than 600 
items). The 600 means the person was one standard deviation 
above the mean score. 

The scales used by test publishers add one additional wrinkle to 
the computational procedures you have just learned. Not only do 
these transformed scales have a convenient new mean and stand- 
ard deviation, but the scales are adjusted such that a person's 
percentile rank is linked to his standard score. For example, the 
lowest 1 percent of the scores shall be assigned a standard score 
of —2.33 standard deviation units. The next 1 percent of scores 
shall be assigned a standard score of —2.05 standard deviation 
units. The relationship between percents and standard deviation 
is based on the use of a normal curve.^ The procedure is illustrated 
below: 



Example 

9. A test is being normed on a sample of 1000 fifth-grade pupils. 
The test contained 75 items and the scores for the 1000 pupils 

3. The normal curve and normal distribution are statistical concepts and a 
discussion of them is not included in the body of this text. For those inter- 
ested, however, an abbreviated table of normal distribution probabilities is 
given in the appendix, along with some directions for using the distribution. 



248 Educational Measurement 

ranged from 12 up to 73. The scores are going to be transformed 
such that the derived scores have a normal distribution with mean 
100 and standard deviation of 20. Here is the bottom of the 
frequency distribution: 



Score 


Freq. 
10 


(2) 


Standard Devia- 
tion Units 


New Score 

Mean — 100; 

S.D. = 20 


20 


4th lowest 1% 
PR. = 4 


-1.75 


65.00 


19 
18 


7 
3 


1 3rd lowest 1% 

rp.R. = 3 


-1.88 


62.40 or 62 


17 


6 








16 
15 


1 
3 


2nd lowest 1% 
P.R. = 2 


-2.05 


59.00 


14 


2 


) 






13 
12 


5 
3 


> Lowest 1% 
(p.R. = 1 


-2.33 


53.40 or 54 



The bottom ten scores make up the lowest 1 percent of the 
total of 1000. Using a table of probabilities for a normal distribu- 
tion, it can be shown that a percentile rank of 1 (representing the 
lowest 1 percent of the scores) is associated with —2.33 standard 
deviation units — that is, 2.33 standard deviation units below 
zero. If the new scale is to have a standard deviation of 20, 2.33 
standard deviations change to (20 X 2.33) or 46.6 points — but this 
is supposed to be 46.6 points below the mean. The new mean is to 
be 100, so 46.6 points below 100 is 53.4. As a result, any student 
who gets 12, 13, or 14 correct in the test will have a score of 54 
reported for him. The test publisher will not report the raw score 
but will report only the standard score. 

The next lowest 1 percent of the scores can be found at score 
points 15, 16, and 17. Again, it can be shown from a table of 
probabilities for a normal distribution that —2.05 standard de- 
viations has a percentile rank of 2 (indicating a score which ex- 
ceeds 2 percent of the others in a normal distribution). This 
changes to a score of 59 on the scale of mean 100 and standard 
deviation of 20. 



Score Reporting 249 

Question 

15. Verify that the new scores for the third and fourth percentile 
ranks are 62.40 and 65 as shown. 



A popular standard score scale based on nine points is the 
stanine scale. The name is an abbreviation of standard nme-point 
scale. The chart below indicates the proportion of the standardiza- 
tion sample who are assigned to each stanine. 

STANINE 123456789 

PERCENTAGE OF 

SAMPLE ASSIGNED 4% 7% 12% 14% 19% 17% 12% 7% 4% 

Any student in stanine 1 has scored as low as the lowest 4 per- 
cent of the scores in the standardization sample. For the partial 
frequency distribution in example 9, all scores from 12 to 20 would 
be assigned a stanine score of 1, since these scores make up the 
lowest 4 percent of the sample of 1000. 



Question 

16. Compute the stanines for the individuals in question 3. 



Grade Equivalent Norms 

Grade equivalents were created by test publishers as a system of 
derived scores to be used with elementary school pupils. The grade 
to which a pupil is assigned is a commonly understood reference 
point. Rather than report, "Tom has a percentile rank of 68," or 
"Tom is in the sixth stanine," the derived score report might in- 
dicate that "Tom's Grade Equivalent is 4." The person receiving 
the information may not know much about standard scores or 
percentile ranks, but is likely to have a fair perception of what a 
fourth grader is like. Now if Tom happens to be in fourth grade, 
the statement is directly understood by the receiver (such as 
Tom's parents) — Tom is doing about as well as he is supposed to 



250 Educational Measurement 

be doing. When he moves on to fifth grade, it will be expected 
that his grade equivalent will increase one year to be 5. 

Although you might actually use percentile ranks or standard 
scores in the classroom, grade equivalent scales are unique to 
standardized tests. Grade equivalent scores are frequently misused 
by people in education. To help you understand the concept of 
grade equivalents better, it seems wise to show you how the test 
publisher goes about establishing this type of derived score. This 
background information should help you avoid the error of mis- 
interpreting grade equivalent scores. 

Go back to the data first presented in question 3 of this chap- 
ter. The frequency distribution of these scores is shown in table 
10. You will note that the scores for this sample of 20 students 
range from 4 to 17. Assume for now that all of the students are in 
the third grade, and that the test was administered in November 
of that third-grade year. A test publisher does something quite 
different in establishing grade equivalent norms — different from 
the case of percentile rank or standard score norms. 

For grade equivalent norms, the same test is administered to 
samples of students in a series of grades. Suppose, for example, 
the test used for third graders will be normed only on third-grade 
students, the test for fourth graders on fourth-grade students, and 
so forth. The distinction will be important to remember later for 
you must keep in mind with grade equivalents that those sixth- 
grade students used in establishing these grade equivalent scales 
are being tested with a measure containing only third-grade level 
concepts. 

Suppose the test has been administered at grades 2, 3, 4, 5, and 
6 to big samples of students from across the country. The testing 
was done in January of the year. The average results for the five 
grade levels are as follows: 

GRADE 2 3 4 5 6 

AVERAGE SCORE 

IN SAMPLE 5 10 20 25 30 

To determine grade equivalents, this information is graphed in a 
manner somewhat like that shown in figure 9. The average score for 
the second graders was 5, but this average is graphed above the 2.5 
grade level and not the 2.0 level. Remember that the testing was 
done in January, which is the fifth month of the usual school year. 



Score Reporting 

30- 

25- 



251 




20- 



Score = 17 



JS 

I 







A 




15- 








_ 








- 




1 


1 


in — — — 


Score = 10 _ 


- Y o 




lU — 






ii 








c 








S- 


— 


/ 


IH 


1 




/ 


■•e 


s 


— 


/ 


ig 


"- 


5- 


y: 


il 


II 


^ 




i» 


jf^ 






• s 


io 






."- 




- 




'll 




_ 




Ico 




0- 




1" 





i;2:3:4:5:6:7 

Grade Level at Time of Testing 

Figure 9 

Hypothetical Grade Equivalent Computing Chart Based 
on a Sample of Students from Grades 2 Through 6 



With grade equivalent scores, the whole number refers to the grade, 
and the decimal refers to the month of the school year. A 2.0 stands 
for a second grader in September; a 4.5 stands for a fourth grader 
in January; a 5.8 stands for a fifth grader in May. The summer 
months are designated by a 0.9 attached to the grade level. 

The graph is constructed by connecting the mean scores obtained 
at grades 2 through 6. Now you are ready to find the grade 
equivalents for the twenty scores first presented in question 3. Look 
at the highest scores, which are 17s. As shown on the chart, draw 
a Une directly out from 17 over to the graph. When the line inter- 
sects the graph, draw another line down to the grade scale. Read 
the grade equivalent directly as 4.2. For a score of 17, it is 4.2 and 
for a score of 10 the grade equivalent is 3.5. 



252 Educational Measurement 

Questions 

17. Complete the Grade Equivalent column for question 3. 

18. List at least two suggestions for techniques to be used to find the 
grade equivalents for the person with 4 correct. 



All of the derived score reporting systems (percentile ranks, the 
various standard score scales, and grade equivalents) were created 
by test publishers to facilitate score interpretation. None has been 
devised to confuse the public or cause serious interpretation prob- 
lems. Grade equivalents, though often maligned, do have two 
advantages as a reporting system: 

1. They provide a reference system (i.e. grade in school) which 
is generally understood by everyone. 

2. They directly indicate growth for an individual. If a student's 
grade equivalents in fourth and fifth grade are 4.5 and 5.5, the 
scores directly indicate that the student has grown 1.0 years be- 
tween fourth and fifth grade. The percentile ranks and standard 
scores for such a student would remain constant. The notion of 
"growth" is primarily associated with the skill areas, which is why 
grade equivalent scales are used only during the elementary school 
years. 

Grade equivalents, like hand guns and automobiles, are not 
inherently bad. Only when an individual uses any of them improp- 
erly does criticism result. Improper use of grade equivalent scores 
can cause damage to a child or group of children. Read the follow- 
ing discussion carefully so that you can avoid misusing this scoring 
system. 

1. Linda is in third grade. She has a grade equivalent of 5.4 in 
reading. Translated, that means she reads as well as the average 
fifth grader after four months. Should this third grader be immedi- 
ately raised to fifth grade? NO! Remember how grade equivalent 
scales are derived — second through sixth graders are tested with a 
third grade test. The test had no fifth grade concepts in it. The 5.4 
for Linda means she is very good at third-grade concepts — as good, 
in fact, as the average fifth grader is at third-grade concepts. The 
score says nothing about her ability to succeed in a fifth-grade 
environment. 



Score Reporting 253 

2. Linda's grade equivalent in arithmetic computations is 4.4. 
Compared to her reading score of 5.4, that's not very good. Rela- 
tively speaking, she must be better at reading than at arithmetic 
computation. Right? WRONG! This is a seldom understood idio- 
syncracy of grade equivalent scales which causes a good deal of 
misinterpretation. Without getting too technical, reason through 
the following steps: 

a. Assume a very bright third-grade student — say one who 
scores a percentile rank of 95 in all areas. In other words, the stu- 
dent is equally good at reading, arithmetic, science, social studies, 
and word skill tests. 

b. Reading is something one learns in the early grades. After first 
learning to read, the student really only needs to practice the skill 
in the following years. Between third and sixth grade no new and 
specific reading skills are presented to the student. The notion of a 
very good third grader reading as well as the average sixth grader is 
not hard to comprehend. A grade equivalent of 5.4 or even 6.4 for a 
third grader is not unusual. Arithmetic is another matter. A person 
begins with the simplest operations in first grade and continues to 
receive new information and new skills all the way through high 
school. Only the unusual third grader can perform arithmetic func- 
tions as well as the average sixth grader. This leads to the major 
problem area: The distribution of grade equivalent scores is differ- 
ent in the various subject matter areas. A student like Linda, who 
had consistent percentile ranks of 95 in every area, might get grade 
equivalents of 5.4 in reading, 4.4 in math, 4.7 in science, and 4.9 in 
paragraph skills. The derived scores will suggest that her best area 
is reading and her worst is math when in fact she is equally facile 
in alll So — 

Do not use grade equivalents to compare a student's scores across 
subject matter areas. If you need to compare scores across areas — 
such as to check relative strengths in reading versus arithmetic — 
use percentile ranks. Use grade equivalents to measure current 
standing or to measure change within a single subject matter area. 

3. The third major error with grade equivalents comes from the 
fact that a high performing child who maintains his relative stand- 
ing in the group wiU appear to "grow" more than one year between 
two grades — say from grade 3 to grade 4. An average performer who 
remains average will appear to grow exactly one year, and a low 
performer who maintains his relative position will appear to grow 
less than one year. Look at these data for three hypothetical stu- 



254 Educational Measurement 

dents, Tom, Dick, and Harry. While the students are hypothetical, 
the data are not. The figures have been taken from the various 
Examiners' Manuals of the Comprehensive Tests of Basic Skills.^ 

Tom, Dick, and Harry have been tested in October of their 
third-, fourth-, and fifth-grade years. Tom's percentile rank in read- 
ing comprehension has consistently been 15. Dick's percentile rank 
has been 50 in each of these years, and Harry's percentile rank has 
been a consistent 85. While one might hope that schooling would 
cause improvement in each of the boys, at least none has lost 
ground. Compared to the national standardization sample, each of 
the boys has maintained his relative position over the three testing 
periods. According to the manual for the CTBS, here are the grade 
equivalents which would be assigned to the three boys: 

Assigned Grade Equivalents 
BOY Grades Grade 4 Grade 5 



Tom 


(constant P.R. of 15) 


2.0 


2.7 


3.3 


Dick 


(constant P.R. of 50) 


3.1 


4.1 


5.1 


Harry 


(constant P.R. of 85) 


5.0 


6.6 


8.4 



The testing, you recall, was done in October of the year. Dick's 
scores, then, are exactly what one would predict. In October of 
fourth and fifth grades his grade equivalents are 4.1 and 5.1 respec- 
tively. He has maintained his relative position and has been "re- 
warded" with a growth coefficient of one year. His teacher and 
parents might not be ecstatic but at least he's making satisfactory 
progress. 

Tom has also maintained his relative position. He is a low- 
performing third grader. In fact, when tested in October of third 
grade, he does only as well as the average second grader. In fourth 
grade his grade equivalent moves to 2.7, or shows an increase of 0.7 
years. From fourth to fifth grade he gains from 2.7 to 3.3, or a gain 
of 0.6. Now the unitiated will be tempted to say that Tom is not 
keeping up. But he is! He has not lost ground, when compared to 
the national standardization sample. 

Harry, on the other hand, while maintaining the constant per- 
centile rank of 85 shows gains in excess of one year. His gains are 
1.6 years and 1.8 years. A newspaper reporter or parent who is not 
familiar with grade equivalents will argue that the school system is 

4. Comprehensive Tests of Basic Skills, California Test Bureau, Division 
of McGraw-Hill, Inc.; Del Monte Research Park, Monterey, California 
(1968). 



Score Reporting 255 

concentrating on the academically talented and ignoring the lower 
performers. 

Look at the scores from another angle. Think of Tom as symbolic 
of a particular ethnic group in a large city. How far behind "grade 
level" is Tom in third grade? Grade level is 3.1 at the time of the 
testing, and Tom's grade equivalent is 2.0. At third grade he is 1.1 
years behind. How about at fourth and fifth grade? At fourth grade 
he appears to be 1.4 years behind the grade level of 4.1, and in fifth 
grade he appears to be 1.8 years behind the grade level of 5.1. 
Remember that Tom is symbolic of a particular ethnic group. Can 
you see the inquiring newspaper reporter's headhne: city allow- 
ing STUDENTS OF ONE ETHNIC GROUP TO FALL FURTHER BEHIND EACH 

year! and the subheading: "Spokesman for ethnic group accuses 
city of systematic bias!" And it's all simply an idiosyncracy of the 
derived scoring system, since Tom has maintained his relative posi- 
tion each year.^ 

When a sample of students takes a test — almost any test — the 
distribution of scores will be somewhat bell-shaped. A large propor- 
tion will score somewhere near the average and smaller percentages 
will be at the extremes. Score differences at the center of the dis- 
tribution, then, tend to be more stable than are score differences at 
the extremes. They are based on a larger norm group. Interpreta- 
tion of an extreme score at either end of the distribution should 
always be done with caution. The extreme derived scores (per- 
centile ranks, standard scores, or grade equivalents) — even for a 
test which has been normed on thousands of students — may reflect 
the performance of only a handful from the sample. 

The instability of scores at the extremes has led test publishers, 
especially those who deal with elementary school batteries, to make 
an obvious suggestion: Try to use tests such that each individual 
will score somewhere near the middle of the distribution. Tom's 
percentile rank in third grade (on the reading comprehension test) 
was 15. The test was a little too difficult for him, and he scored 
near the extreme lower end.* The teacher probably has a pretty 
good idea that Tom's percentile rank will remain at about 15. 
Rather than administer the next test such that he will again score 
at the extreme, it would be better to administer a somewhat easier 



5. Statistically speaking, the reason for the idiosyncracy is that the 
standard deviation of grade equivalent scores increases as you move to 
higher grades. 

6. Arbitrarily, an "extreme" score will be defined as one in the upper or 
lower 15 percent. 



256 Educational Measurement 

test so that he can score in the "fat" part of the distribution. Then 
his scores would be more stable. Most elementary achievement test 
batteries are constructed so that the task of giving Tom a more 
appropriate test is not too difficult. This concept is called "in-level" 
testing. 



Marking: Reporting the Results 

As was detailed in chapter 4, the manner by which you report 
scores on tests and measurements depends on some other decisions 
which must be made earlier. If a test is criterion-referenced, the 
reporting can be accomplished using a checklist. The checklist in- 
dicates, for each individual, which objectives have been completed 
at or above the minimum acceptance level. If the test is not 
criterion-referenced, then the student needs some sort of reference 
system before any real sense can be made of an individual result. 

The method of reporting scores is just one of five interdependent 
decisions which the teacher must make in planning a test. These 
interdependent categories were the topic of chapter 4. Suffice it to 
say at this point that the manner of reporting back to the students 
is indeed a function of things like specificity of test objectives, 
comprehensiveness of the testing, test administration decisions, 
and the level of expectations you hold for the class. 

If instruction is to be individuaUzed, the reporting system should 
be individualized as well. The topic will be discussed in relation to 
project evaluation in chapter 12. IndividuaUzed instruction sug- 
gests a checklist reporting system. The school specifies the task to 
be accomplished and the student works on these tasks at his own 
pace. Periodically, the school reports to the student and his family 
which tasks have been satisfactorily completed. 

If grades (A, B, C, D, F) are to be used as a course evaluation 
technique, the author strongly prefers a type of "contract" system. 
The system is really quite simple and goes something like this: 

1. You have to decide in advance the minimal amount of per- 
formance you will accept as satisfactory in the course (or grade) — 
whatever you are teaching. This amount of performance is assigned 
a D (assuming a D is considered passing in your school). 

2. You then add increments to this minimal amount for the 
grades of C, B, and A. Obviously, the highest grade should require 
the highest performance level by the student. 



Score Reporting 257 

3. Finally, you make this system explicit to the students before 
they begin the course. 

This may sound easy, but there are pitfalls. Here are a few of 
them you should try to avoid: 

1. Don't set the minimal level too high. Don't get too idealistic 
at this point. Set the level where you believe you would actually 
pass a student. The minimal level must be realistic. If you're new 
to a school district, get advice on typical minimal performance 
standards for that district. 

2. If you've never tried the contract system before, you may 
find that half way through the quarter you want to change the 
rules. Change them if you feel you've made a mistake which is 
penalizing the students unfairly. Don't change if you've given the 
students a break and you just want to stiffen the requirements. 

3. Don't make the higher grades simply contingent on more of 
what is required for the lower grades. If the student must complete 
100 long division problems for a C, it would be a little silly to ask 
for the completion of 200 for a B. Perseverence should not be the 
sole criterion upon which high grades are awarded. 

4. Make the highest grade or grades contingent on individual 
initiative. Require an independent project; a presentation to the 
class; effective leadership of a group; or some other activity which 
the student creates. Again, avoid making these tasks too time 
consuming. 

5. Don't be afraid to include test performance in the marking 
system, but try to avoid making it the one single measure. You 
could, for example, establish a passing point on all tests and re- 
quire that all students simply pass the test. Establish a cutoff at 
a realistic level, of course. Add to the test scores things hke 
projects, papers, group activities, independent studies, attendance 
of special events, or leadership roles. If possible, provide more than 
one route to an A or a B. 

6. What happens if a very high proportion of your students 
completes the tasks and deserves the higher grades — the A's and 
B's? Do you hang your head in shame and apologize to your faculty 
colleagues for being an easy grader — for teaching a "Mick"?' 
Heavens no! Just as with the clear specification of behavioral ob- 
jectives, the clear specification of grade requirements opens the 

7. A "Mick" is a term frequently used by students to describe a very easy 
course — one where a high grade can be "earned" for a minimal amount of 
effort. 



258 Educational Measurement 

door for meaningful faculty discussions. If you say: "These were 
my requirements for each grade, and here is proof that these stu- 
dents fulfilled the requirements," the only possible challenge is of 
the requirements themselves. What requirements should be altered 
or added? 

The challengers to your requirements would have to be specific 
in their recommendations and the discussions which would result 
might improve your list of requirements. A clearly specified list of 
requirements for each letter grade will take grading out of the 
realm of the mystical. Some teachers prefer mystical grading to 
objective grading. Mystical grading takes a lot less effort. 

Quota systems are to be discouraged. Some measurement writers 
suggest using a normal distribution of probabilities as the basis for 
grade assignment. In such a grading system, 7 percent should re- 
ceive A's and F's, 24 percent B's and D's, and 38 percent C's. Such 
a system is inappropriate for most pre-coUege classroom situations 
because the student's grade is contingent upon not only what he 
does, but also depends on the caliber of the other students in the 
class. Dooming 7 percent to failure regardless of performance level 
is just plain silly and probably unforgivable. A quota system en- 
courages the worst kinds of inter-student competition and discour- 
ages individualized instruction. 

If standardized tests are given, should students and parents be 
given reports of these results? Of course they should, but it follows 
that the student or parent must understand what the score means. 
If percentile ranks, grade equivalents, or standard scores are used 
in the reporting, some time should be set aside to inform the re- 
ceiver about the meaning, interpretation, and limitations of each 
system. If the school is unwilling to let the public see the results of 
a standardized test, then the school probably should quit adminis- 
tering the test! Look at it this way: 

The school administers a standardized test. The school must 
have had a purpose in mind, otherwise why spend the time and 
money on the test? 

It follows then, that someone in the school is using the test 
results to make decisions about the students tested. 

It also follows that some of these decisions might not be pleasing 
to the student or to his parents. 

Isn't it one of the cornerstones of our legal system that the ac- 
cused has the right to face his accuser? 

Sound silly? It isn't silly at all. In fact, the use of hidden test 
results has caused the misplacement of a lot of school children. 



Score Reporting 259 

Sometimes the tests were completely inappropriate for the students 
in question, such as using a test written in English for a Spanish- 
speaking child, or using an urban-based tester for a child recently 
arrived from a poor, rural area. The student and parent have the 
right to know which tests were used, what the scores were, and 
what the scores mean. They should have the right to challenge 
the appropriateness of the measures for a specific individual. 

Should test scores be reported? Yes — provided the receiver of the 
information is willing to spend the necessary time for the scores to 
be interpreted. 



"Acquired Behavioral Dispositions" 

The Measurement of Attitudes and Interests 



"I don't know how she does it, but she keeps 'em motivated — and 
when those kids are motivated, they can really learn!" 

"If only I could get Nancy interested in something she wouldn't 
be so bored in class." 

"Eddie could do the work, but his attitude is so bad. Anything 
that suggests school to him triggers a negative response." 

"These kids just don't value learning like we did. No wonder 
there's so much vandalism in the schools!" 

"A random sampling of student opinions indicates that the pro- 
gram is well liked here." 

Interests. Attitudes. Values. Opinions. Motivation. If a student 
develops certain interests, will some desired behavior follow? Can 
expected behaviors be conjectured on the basis of studies of atti- 
tude, value, opinion, and interests? Can these concepts be measured 
in the classroom? 

The term "acquired behavioral disposition" used in the chapter 
title has been coined by Campbell.^ He notes that the terms atti- 

1. Donsild T. Campbell, "Social Attitudes and Other Acquired Behavioral 
Dispositions." In S. Kock, ed., Psychology: A Study of Science, Vol. 6 (New 
York: McGraw-Hill, 1963), pp. 94-172. 



260 



"Acquired Behavioral Dispositions" 261 

tude, opinion, valuing, and interests all really mean about the same 
thing. Each indicates a tendency to act in a certain way when con- 
fronted with some situation. A commuter, driving along the express- 
way, takes the last cigarette out of the pack. Will his acquired 
behavioral disposition be to toss the empty pack wrapper out the 
window? What is his attitude toward Utter? Is he interested in 
clean streets? Where does obeying the law fit into his system of 
values? What is his opinion of litter-bugs? Under certain conditions 
the answers to these questions will allow you to predict with some 
accuracy whether or not the man will throw the wrapper out the 
window. 

Since "acquired behavioral disposition" is cumbersome to use, 
the term will be dropped henceforth and replaced with either 
"attitude" or "interest." An assumption of this chapter is that the 
terms attitude, interest, opinion, and value mean essentially the 
same thing. Each describes the tendency for a person to behave in 
a certain manner. 

A discussion of techniques and approaches which can be used to 
change attitudes and behaviors is inappropriate in a book devoted 
to measurement in education, but the measurement of attitude, and 
especially the measurement of attitude change, is important in any 
attitude and behavior change program. Discussions of attitude 
measurement techniques which could actually be used by a class- 
room teacher are included in this chapter. Some of the more 
elaborate techniques will be mentioned, but not covered in any 
depth. 



The Relationship Between a Student's Attitude 
and His Behavior 

The most comprehensive statement about the relationship between 
attitude and behavior is: Under certain conditions, knowledge of 
a student's professed attitude will allow the accurate prediction of 
subsequent behavior. The task of delineating cases where the rela- 
tionship does not hold is easier than the enimieration of cases where 
it does. For example, a whole series of hints were given as part of 
chapter 7 accompanying the discussion of typical performance 
standardized measures. These hints are reviewed below. Each de- 
scribes cases where a student's expressed attitude impUes one 
behavior, but when the student is actually confronted with a situa- 



262 Educational Measurement 

tion where he has a chance to carry out that behavior, the perform- 
ance is not as predicted. The comments from chapter 7 are reviewed 
below. 

1. Answers to most measures of typical performance can be 
faked, or at least distorted. The respondent may give answers which 
are believed to be socially acceptable, or he may be trying to fool 
or impress you. 

2. Interests and attitudes do not necessarily predict aptitude. 
The student may make a statement which causes you to predict he 
will later behave in some specific way, but the student doesn't come 
through with the performance — not because of lack of desire, but 
because of lack of aptitude. 

3. Attitudes and interests change in students. Some of these 
changes are moment-by-moment; others occur over longer periods 
of time and are more enduring. A man expresses a very negative 
attitude toward picking up hitchhikers. The attitude would cause 
you to predict that he never would stop for a hitchhiker — and he 
almost never does. But once in a while, when conditions are just 
right (or wrong, whichever way you wish to interpret this) he will 
stop. His behavior is out of line on these occasions with his earlier 
expressed attitude — or perhaps it isn't. His attitude may have 
changed, albeit temporarily, just long enough to cause him to give 
the infrequent lift. 

4. Sometimes a student expresses an attitude which is com- 
pletely honest, given the amount of information available at that 
moment. Based on this expressed attitude, you are caused to pre- 
dict a certain behavior. However, as the student's information 
becomes more complete, his attitude changes (unknown to you) 
and his behavior seems out of line with the earlier expression of 
attitude. For example, a student expresses a very positive attitude 
toward a proposed club. "What would you think if we started an 
amateur radio club in this school?" "Hey, great! I've always wanted 
to try that!" The response would indicate that if the club got 
started, this student would join. Later, when the club does start, 
the student does not join. Why? Probably because of new informa- 
tion which only then was available. Maybe the club meets at an 
inconvenient time. Maybe the cost is prohibitive. Maybe the code 
hurts his ears. Maybe his parents ask him to give up violin lessons 
if he's going to start this new activity. Maybe a lot of reasons! But 
a major reason for the apparent lack of congruence between ex- 
pressed attitude and actual behavior is the lack of information 
available at the time he was "spouting off." 



"Acquired Behavioral Dispositions" 263 

5. Attitudes are often situation specific. The way you respond to 
a stimulus often depends on the situation in which you perceive 
yourself to be. An attitude may be expressed by a student at a 
time when he perceives himself as being threatened. The attitude 
leads you to predict that this student will behave in a certain way 
when confronted with a situation. However, when the situation 
occurs, the student behaves differently. Why? Because at the mo- 
ment of truth the student no longer feels threatened. The attitude 
is different in an unthreatened situation, and so is the predicted 
behavior. 

All of the above reasons for the apparent nonagreement of 
expressed attitude and actual behavior were discussed more com- 
pletely in chapter 7. Some doubts should be in your mind at this 
time. Have you begun asking the question: Is it ever possible to 
predict someone's later behavior from the attitude expressed? 

Under a wide variety of conditions, later behavior can be pre- 
dicted from an earUer expression of attitude. Studies based on the 
two primary measures of vocational interest (the Strong and the 
Kuder) have shown that people tend to go into vocational areas 
toward which an earlier interest was expressed. Your common sense 
will tell you that you frequently can predict the later behavior of 
people around you by knowing their attitudes. The way your ac- 
quaintances behave in given situations is conditioned by attitudes 
they have probably expressed in your presence. If the attitude is 
negative, the acquaintance will either avoid the attitude object or 
be aggressive toward it. If the attitude is positive, the acquaintance 
will seek out the attitude object. A neutral attitude probably leads 
to passive behavior or indifference. 

You can frequently predict later behavior based on attitudes 
expressed at an earlier time. Just keep in mind that you cannot 
always do so. 



Attitude Change Programs Linked to 
Desired Behavior Change 

Think again of an attitude as an acquired behavioral disposition. 
How does a person acquire a disposition to behave in a certain 
manner? The answer is complex, but must go something like this: 
Throughout a lifetime of facing similar or related situations, the 
person learns which responses are rewarding, which will be passive, 
and which will cause pain or unhappiness. The more experience the 



264 Educational Measurement 

person has had with related stimulus situations, the more stable 
will be his disposition to respond in a given manner. After carefully 
taking the notes of caution into consideration, it is then reasonable 
to assume that knowledge of behavioral dispositions which are 
based on experience will allow you to predict later experience. How- 
ever, how about the case of trying to bring about desired behavior 
change through a program designed to change attitudes? The link 
between changed attitude and changed behavior is a more tenuous 
one than the link between attitudes acquired over a period of time 
and the associated behaviors. 

The topic of attitude and behavior change is very relevant to 
education, and particularly to special educational programs. After 
all, why would a "special" program be set up in the first place? — 
to change some behavior in the students, probably. Very often, the 
behavior changes are linked to desired attitude changes. Knowledge 
of the special problems associated with the measurement of atti- 
tude change is therefore especially important. 

Programs to change attitudes approach the problem from three 
general directions. The most obvious approach is one in which new 
information is provided. Such an approach is appropriate in cases 
where the undesirable attitude is believed to be caused by lack of 
information or even misinformation. School programs designed to 
decrease the number of users of drugs or alcohol are frequently 
based on this kind of presumption. 

A second approach is a kind of desensitization process. Suppose, 
for example, that you have a student who, for some reason deep in 
his background, withdraws sharply whenever you touch him. Now, 
you probably don't go around routinely touching students, but 
occasionally it is a natural thing to do. To see the student withdraw 
sharply every time you touch him is rather disconcerting, especially 
if you figure he responds that way to every adult. 

You would like to change that behavior. A desensitization proc- 
ess simply involves linking the touching action to something known 
to be rewarding to the student. Perhaps praise will do the trick. 
Every time you touch the boy, praise him. Maybe you'll have to 
use a piece of candy or permission to do some special favor, but in 
some way you need to link a reward with the touching action. 
Desensitization programs can be done with groups as well. The key 
aspect is linking a clearly rewarding action to a stimulus previously 
regarded by the individual or the group as threatening or un- 
rewarding. 



"Acquired Behavioral Dispositions" 265 

The third approach to attitude change is to cause the person or 
group to perform some act or to behave in some way which has 
previously been avoided. For example, suppose a girl in your class 
refused to get very involved in any of the game activities which 
were played outside during recess period. Her avoidance, you find 
out, is based on a fear of injury as well as a feeling that she is 
incapable of playing. Assuming you are sure she will not be injured 
and is capable of performing satisfactorily, the behavioral approach 
to changing attitude requires that you figure out some way to 
induce her to participate. Perhaps you can start slowly and link 
some desired reward with performance. This is not a text on behav- 
ior modification, but in some manner you get the girl to participate. 
By participating, she will see that her earlier fears were unwar- 
ranted. When this happens, the change in attitude will probably 
follow. 

Before any program of attitude change can be successful in lead- 
ing to behavior change, three important requirements must be 
fulfilled: 

1. The relationship between the attitude being worked on and 
the desired behavior must be clear to the student. Just telling the 
student about the composition of heroin and its general effects on 
the body may not be enough. The student must link those effects to 
his body. The student must know that the desired behavior is 
avoidance. 

2. The student must learn the skills necessary for performing 
the desired behaviors, and must be capable of doing so. A program 
to change a student's attitude toward the city library, and subse- 
quently to cause him to use it, will fail if certain skills and capabiU- 
ties are absent, or if the library staff is unprepared to help the stu- 
dent. For example, the student must know where the library is 
located; he must have the means of traveling from home to the 
library; he must know how to find the books once in the library; 
and he must not be frustrated by the library staff in the attempt. 

3. Once the attitude change program is complete, the person 
must have a chance to manifest the desired behaviors. Obviously, 
you can never really see the results of the attitude change program 
until the actual situation arises and the student at least has a 
chance to respond in the desired direction. 

There are two results of an attitude change program which are 
easy to understand: The desirable result occurs when the attitude 
changes and the desired change in behavior also appear. Less satis- 



266 Educational Measurement 

factory, but at least understandable, is the case where neither atti- 
tude nor behavior change occurs. Sometimes, however, a measured 
change in attitude is not accompanied by the desired change in 
behavior. And sometimes, the desired change in behavior is not 
accompanied by a measureable change in attitude. Why? 

1. All of the cautions stated in chapter 7 and restated in this 
chapter are appropriate to programs of attitude change. 

2. Prior commitment plays a role. If the students had, prior to 
the attitude change program, committed themselves in a direction 
opposite of the desired one, then the probability of behavior change 
is sharply reduced. If no such prior commitment has been made, 
then the likelihood of bringing about the changed behavior is much 
increased. Prior commitment seems to have a greater effect on be- 
havior than attitude. The expressed attitude may change; but the 
behavior does not follow. 

3. The program may be successful in bringing about a change in 
attitude — a change in the student's disposition to respond to a 
stimulus situation in the desired manner. However, if the environ- 
ment of the student does not support the changed behavior, the old 
behavior (and attitude as well) will probably reappear. You may 
set about to change the attitudes of the students in your class 
about reading the newspaper at home. If, however, your attempts 
at doing so are met with rebuke or ridicule by the parents, the 
old attitudes toward reading newspapers and the old behaviors of 
avoiding newspapers will both reappear. 

4. An attitude change program based on strong threats has more 
likelihood of changing expressed attitudes than changing actual 
behavior. If you scare the hell out of the students concerning the 
effects of contracting venereal disease, their expressed attitudes 
will probably indicate that they will avoid such contacts at all costs. 
Their behavior, in fact, may not conform. Actually, this goes back 
to the earlier caution about situation-specific measures. The atti- 
tude was expressed while in a situation perceived as threatening. 
The behaviors are expressed in situations not usually seen in this 
light. 

5. Finally, there is the problem of changing some attitudes but 
of missing the key one which is necessary to trigger the desired 
behavioral response. An example from the literature will illustrate. 
Zwicker^ reports an attempt to alter the behavior of teenagers in 



2. B. L. Zwicker, "Behavioral Effects of Attitudinal Change," Psychologi- 
cal Reports, 1968, pp. 839^2. 



"Acquired Behavioral Dispositions" 267 

seeking routine tests for diabetes. A persuasive communication was 
given to a group of young men, focusing on three aspects: personal 
vulnerability to the disease; how the disease would affect a person 
if he contracted it; and the importance and benefits that could 
accrue from early detection. The program did result in changed 
attitudes and perceptions of the last two goals — ^but no change took 
place in the students' perceptions of their own vulnerability. If a 
person does not feel particularly vulnerable to the problem, no 
preventive action wiU result. If the attitude seems to change, but 
the behavior does not follow, the cause may be that the changed 
attitude was not the one which would eventually trigger the desired 
behavior. 

Enough about the relationship between expressed attitude and 
actual behavior, and especially on the question of attitude and 
behavior change programs. How could a teacher set about measur- 
ing "acquired behavioral dispositions" in the classroom? 



Likert-Type Items 

A variety of techniques for creating paper-and-pencil measures of 
attitude have been suggested, but the one most frequently em- 
ployed was originally suggested by Likert. This approach requires a 
series of stimulus statements about some attitude object and a 
scaled set of responses. The following is a Likert- type item: 

A person with limited reading ability can still get along just fine 
in this world. 

. Strongly Agree 
Agree 

. ? 

Disagree 

Strongly Disagree 

"Vhe stimulus questions, generally speaking, are such that they 
indicate a strong note of agreement or disagreement with the gen- 
eral attitude under investigation. 

In constructing an attitude measure of some concept, you first 
assemble a variety of statements. Some are worded in a positive 
sense and some negatively. Each is linked to a five-point scale. One 
end of the scale is designed to indicate clearly one pole of the atti- 
tude in question. If the above item was from a scale designed to 



268 Educational Measurement 

measure attitude toward the "importance of reading to satisfaction 
in life," the desired response would be "Strongly Disagree." A 
person who believed that reading was very essential to later success 
in life would tend to strongly disagree with the stimulus statement. 
The actual measure of attitude is found by assigning numbers to 
each of the responses. For the example, 5 points would be assigned 
to "Strongly Disagree" since this is the desired response; 4 points 
would go with "Disagree"; and down to 1 point for the response 
"Strongly Agree," which is the least desired. 

To construct an attitude scale of this type, start by generating a 
number of general questions about the topic at hand. Then, write 
a few questions to fit each of the general questions. For example, 
suppose the topic was "Attitude toward Reading," where reading is 
defined in the more limited sense of the reading program in the 
school in which you happen to be teaching. Here are some general 
statements as a starting point. Perhaps there are some you would 
add: 

a. Attitude toward the importance of reading success as a de- 
terminant of success in later life. 

b. Attitude toward the importance of reading success as a de- 
terminant of success in other school subjects. 

c. Comparison of attitude toward reading to attitude toward 
other subjects in the school's instructional program. 

d. Attitude toward reading for pleasure, especially in comparison 
to other competing leisure time activities. 

e. Attitude toward having teacher select reading materials for 
class use. 



Questions 

1. Add at least two general statements to the list. 

2. Prepare four general statements on the topic of "Attitude toward 
working on school assignments individually" rather than as part of 
the class group. 



The next task is to link a number of specific stimulus statements 
to each of these general statements. Remember that each state- 
ment should take a firm stand on the topic, in one direction or the 
other. The example given earlier ("A person with limited reading 



"Acquired Behavioral Dispositions" 269 

ability can still get along just fine in this world") takes a firm 
stand at the end of the scale which indicates that reading is not 
important for success in later life. A statement in the other direc- 
tion might be one like this: 

People who earn a lot of money in this world probably have learned 
to be good readers. 

In this item, a response of "Strongly Agree" would be the preferred 
one. It would receive the score of 5, and the "Strongly Disagree" 
response a score of 1. 

The concept of a "preferred response" sometimes causes people 
annoyance. You might say, "That's just not true that people who 
earn a lot of money have learned to become pretty good readers. 
Take the case of ol' . . ." — and a few examples are given. The 
preferred response concept illustrates the importance of finding out 
exactly how any measure defines the concept being measured. In 
constructing your attitude measurement scale and designating the 
preferred response for each item, you are defining in behavior terms 
precisely what you mean by a "good attitude toward reading." You 
are not saying, "This is the definition," or, "This is the correct 
definition." You are saying, "This is what I believe illustrates a 
good attitude toward reading." 

When you use measures constructed by other people, even mea- 
sures constructed by famous people or million-dollar companies, 
this is precisely what they do also. A publisher of a college aptitude 
test cannot say, "This is the correct set of exercises for predicting 
college success." The combination of items defines for you how 
that publisher defines skills necessary to succeed in college. A pub- 
lished measure of "Introversion" does not define introversion once 
and for all. Instead, the test only tells you how that particular test 
author defines this construct. 

Summarizing the following hints before moving on: 

Hint 1: Write a number of actual items about each of your gen- 
eral statements. 

Hint 2: Phrase about half of the items for each general statement 
in a strongly positive sense, and about half in a negative sense. 

Hint 3: Clearly indicate, as you construct the items, which end 
of the response scale is, in your opinion, the preferred end. 

The Likert-type measure will be difficult to use if the preferred 
response is not one of the extreme ones. For example, consider this 
item. 



270 Educational Measurement 

Reading is the most important thing that happens in school. 

Strongly Agree 

_^_ Agree 

. ? 

.^__ Disagree 

Strongly Disagree 

The problem with the item is that the preferred response is 
probably not legitimately at either extreme. To be sure, you 
wouldn't like to see your students strongly disagree with the state- 
ment, or even just disagree. On the other hand, the school may 
have other goals which are even more important than reading. Any- 
how, many people think so. Thus, the preferred response is not at 
the "Strongly Agree" end either. If the item were included, the 
score of 5 would have to be assigned to either the "Agree" or "?" 
response, and that turns out to be both intellectually questionable 
and operationally tedious. 

Hint 4: Construct the items such that the preferred response is 
clearly in one of the two extreme locations on the response scale. 

People differ in their willingness to take a stand on issues. Two 
students who feel equally as strong in their attitudes toward read- 
ing might respond differently to the question because one of them 
hesitates to take any stand, and especially a strong stand, on any 
topic. A student with this kind of hesitation will tend to go for the 
middle response ("?") more frequently than when the actual atti- 
tude felt warrants a shoulder shrug. If you feel that forcing each 
student to make some sort of commitment on each item is impor- 
tant, you can simply eliminate the middle response. Those people 
who hesitate to take a stand will now be forced to go one way or 
the other. Of course, the hesitancy will probably keep them from 
choosing the extreme response at any time, but at least you'll have 
an indication of the direction of their attitude. 

Hint 5: In cases where it is important that each student at least 
indicate the direction of his attitude toward each item, eliminate 
the middle response position. 

Hint 6: Try to write the stimulus (the statement) as specifically 
as possible so that all of the students will interpret it in approxi- 
mately the same manner. Take for an example the item given 
earlier: 

Reading is the most important thing that happens in school. 



"Acquired Behavioral Dispositions" 271 

The purpose of the item is to discover the student's attitude toward 
the importance of reading in comparison to the "other things" that 
happen in school. What other things? The students will have differ- 
ent perceptions. One will consider only the other formal subjects 
(science, mathematics, spelling, social studies), while another will 
include all of these but also school elections, athletic activities, and 
social interactions. The scaUng problem mentioned earher and the 
vagueness problem mentioned here could both be corrected by re- 
stating the item: 

Out of all the subjects taught in school (for example, science, 
arithmetic, social studies, and spelling), reading is the most im- 
portant. 

Now everyone is answering the same question and "Strongly 
Agree" is the preferred response (only, of course, if that's how you 
define a good attitude toward reading). 

The same series of comments applies to the response list as 
well. Where possible, change the continuum of responses from 
"Strongly Agree" through "Strongly Disagree" to reflect more 
specific amounts. Take, for example, this item: 

Laws are made to punish people. 

Strongly Agree 

^_^ Agree 

Disagree 

Strongly Disagree 

Assimiing the writer of this item believed that laws are made to 
protect people and not to punish them, the preferred response is 
"Strongly Disagree." The difference between any two adjacent 
responses, though, is vague. How does the student objectively dif- 
ferentiate between "Disagree" and "Strongly Disagree"? To avoid 
this problem, the item could be shghtly reworded such that more 
objective response statements are possible. For example: 

What proportion of laws are made to punish people? 

All of them 

About half of them 

__^ A few of them 
None of them 



272 Educational Measurement 

A very positive attitude toward the law^ would lead the student 
to choose the bottom response. Whether or not you agree that the 
fourth response is the preferred one, the fact is that now all the 
students should have a fairly equivalent picture of the meaning of 
the four response categories. The next hint, then, is 

Hint 7: Where the response categories could easily be linked to 
specific amounts, restructure the stimulus statements to conform 
to this pattern. 

Using Hint 7 will probably require some tryout work for your 
attitude measure. Whatever amounts you choose to give as options 
have to define a range of responses, all of which are reasonable. The 
distance between any two adjacent responses must have psycho- 
logical meaning. Take these four responses to the previous ques- 
tion: 

What proportion of laws are made to punish people? 

. All of them 

Most of them 

The majority of them 

_^__ At least half of them 

What is the difference between "Most of them" and "The ma- 
jority of them"? or between "The majority of them" and "At least 
half of them"? The range of responses is limited — no one can possi- 
bly indicate an attitude that very few laws are made to pimish 
people. 

Hint 7a: When you use response categories linked to specific 
amounts, allow the most gradations to occur at the end of the scale 
where you think change is going to occur. 

The response item about the proportion of laws designed to 
punish people could have been constructed in these two manners: 

Form 1 Form 2 

____ All of them All of them 

About half of them Most of them 

A few of them About half of them 

None of them None of them 

Form 1 has three distinct points from the "About half" down to 
the "None" response, while Form 2 has three points from the 
"Half" response up to the "All of them" category. Unless the group 

3. And some would say a very naive one. 



"Acquired Behavioral Dispositions" 273 

being measured is particularly cynical, Form 1 seems more appro- 
priate. The tendency would be, it seems, for students to feel that 
half or less of the laws are made to punish people, rather than half 
or more. The greatest number of gradations, then, should occur in 
that direction. 

Hint 8: An interesting procedure is to ask the same question 
twice — once making the question general, the second time more 
specific — to see if students are consistent. 

For example, these two questions move from general to specific: 

For most students, doing well in reading will lead to doing well in 
all the school subjects. 

In this school, a good reader is likely to be good at every subject. 

The two items would be separated in the final measure, of course. 
This doublet goes from "most students" which imphes students 
everywhere, to "this school." The general to specific could also go 
from "most students" to "I" — the second statement might have 
been "If I am a good reader in this school . . ." Sometimes the 
doublet can begin with a general statement, then give a specific 
instance to see if the students really are consistent in the general 
attitude. For example. 

If a person is clearly guilty of a criminal act, a trial by jury is not 
necessary. 

and 

A policeman catches a woman while she is stealing a car. Still, 
she will not admit to the act. In this case, a trial by jury is not 
necessary. 

The preferred response for both items is "Strongly Disagree." A 
trial by jury is guaranteed by the Constitution and no exceptions 
should be made. The item will show the teacher whether or not the 
student really knows what that first statement means and if he is 
willing to apply the principle in the second item. 

Hint 9: If at all possible, try the items out orally on two or three 
students like the one with whom the instrument will be used. 

It is common practice for measurement authors to suggest that 
the teacher go through trials and rewrite sessions with all measures 
constructed for classroom use. The press of other responsibilities 



274 Educational Measurement 

usually makes a tryout a practical impossibility. In the case of an 
instrument written to measure attitudes, however, the tryout be- 
comes especially important. The major task of the tryout is to find 
out if the students are answering the question you meant to ask. 
Sometimes the phrasing you use in an item is completely clear to 
you, but the students misunderstand. Such items will be easy to 
pick out if you administer them orally to two or three students. If 
the items are misunderstood, they will tend to give responses which 
appear ridiculous to you. A couple of probes at that point will bring 
to light points of misunderstanding, and the wording can be cor- 
rected. 

If a measure of an "acquired behavioral disposition" was to be 
published and standardized, the actual writing of the items would 
be one of the first steps. The publisher would try the items out at 
least once with the kind of people for whom the measure had been 
constructed. These tryouts would be primarily internal consistency 
checks. For example, suppose the measure was a liberal/conserva- 
tive scale. A very conservative person should tend to give the 
conservative responses on all of the items, and the liberal person 
should tend to give the liberal responses. If, on one item, a lot of 
"cross-overs" occur, where some liberals and some conservatives 
answer at each end of the scale, the item is a questionable one. 

Generally speaking, though, the classroom teacher, or anyone 
else constructing an attitude measure for use one time with one 
group of students, will not go through the tryout period. If the 
items are constructed around some carefully considered general 
schemes, following the hints given in this section, the measure 
should provide useful information. Measures of interest and of atti- 
tude are important for most classroom and project objectives. The 
lack of a published and standardized measure for such objectives 
is not a satisfactory reason for ignoring this assessment. An instru- 
ment which has been carefully written, even without extensive 
tryout efforts, is better than no measure at all. 



Questions 

3. Choose a topic about which you would like to determine student 
attitudes. You can use students of any age. First, write at least five 
general statements about the topic, reflecting different aspects. Then 
construct a twenty-item attitude measure with Likert-type items. 



"Acquired Behavioral Dispositions" 275 

4. Administer the twenty-item attitude scale from question 3 to at least 
15 students. Compile total scores for the tests, using a score of 4 if 
the person gives the most preferred response, 3 for the second most 
preferred, down to 1 for the least preferred. Show a frequency 
distribution. 

5. Engage two high-scoring and two low-scoring students in brief con- 
versations about the topic at hand. Based on their comments, do you 
think the high- and low-scoring students actually have real differ- 
ences about the topic? 

6. Try to locate some sort of attitude scale which has been used in a 
research study. Critique the instrument, based on the hints of this 
chapter. 



The Method of Equal- Appearing Intervals 

Thurstone's method of equal-appearing intervals will be described 
only briefly here because it is really of most interest to measure- 
ment specialists. An attitude measurement device developed follow- 
ing this technique involves an extensive development period. The 
process goes something like this: 

1. First, a whole series of statements about the particular con- 
struct are developed. For example, suppose the construct happened 
to be "Attitude toward teaching deaf children." The scale con- 
structor would try to develop a large number of items (as many as 
200) which covered the range of possible responses. The pool of 
items should run all the way from, "The most rewarding occupation 
in the world is the teaching of deaf children," to "Deaf children are 
incapable of learning and should be placed in a caretaker environ- 
ment." 

2. Next, a panel of judges is assembled. The judges are supposed 
to know something about the construct at hand. For the example, 
the judges would be experts in the area of teaching deaf children, 
and would be aware of the entire spectrum of attitudes manifested 
by people on the topic of teaching children. 

3. The judges, working individually, place the 200 or so state- 
ments into eleven piles. The first pile contains those statements 
indicating the least favorable attitude. The last (eleventh) pile is 
saved for those statements which, in the judges' opinion, indicate 
the most favorable attitude. Each statement is placed in one of the 
eleven piles by each of the judges. 



276 Educational Measurement 

4. In actually constructing the scale, the developer looks for 
consistency and range. The goal is to select some ten to twenty-five 
items from the pool which will eventually make up the scale. The 
items are selected to range evenly from category 1 to category 11. 
The scale constructor looks for items which are agreed upon by the 
judges. An item which is consistently rated by all of the judges at 
a 3, or at a 2-3, or a 4, is preferred over one upon which the judges 
couldn't agree. Again, it's a matter of internal consistency — a wide 
range of responses to an item by the judges indicates some amount 
of ambiguity in the statement. 

5. Suppose twenty items are selected for the final scale. The 
items would be given scale values roughly corresponding to the 
average assignment by the judges. With twenty items, the scale 
constructor would try to get about two items for each of the eleven 
original categories. The twenty items would then be randomly 
arranged and administered to people in the intended audience. 
Each of these individuals would be asked to mark those statements 
with which he or she agrees. For example, perhaps five items on the 
"Attitude toward teaching deaf children" scale are these, where 
the average pile-placement by the judges is given in parentheses: 

. The most rewarding occupation in the world is the teaching 

of deaf children. (10.8) 

Deaf children are incapable of learning and should be placed 

in a caretaker environment. (1.2) 

X I would find working with handicapped children very reward- 
ing. (9.3) 

Expenditures for the education of deaf children are too high. 

(3.5) 

X Teachers of the deaf deserve higher salaries than teachers of 
children with full hearing. (6.0) 

The actual scale would contain about fifteen more items. For this 
individual, who has already marked the third and fifth items, the 
score would be obtained by averaging the scale values of all those 
marked. As you can see, this individual did not subscribe to either 
of the extreme positions, and will probably end up with a scale 
value somewhat above the average score of approximately 6.0. 

In assembling such a scale, the judges don't necessarily have to 
be gray-bearded professors with long titles and many degrees. A 



"Acquired Behavioral Dispositions" 277 

"judge" is defined as someone who ought to know a good deal about 
the topic at hand. If the topic is school-related, parents or other 
teachers could be used. The direction for the judges is to sort the 
statements in the "proper" sequence regardless of personal atti- 
tudes on the topic. 

One other comment: The technique's usefulness is not hmited to 
cases where the final array of items will be used in an attitude 
scale. Sometimes the final scale itself is instructive. For example, a 
researcher might be interested in determining how high school 
students view the relative seriousness of certain types of crimes. 
Brief descriptions of a large number of crimes could be assembled 
and a sample of students asked to place them in eleven stacks, with 
the eleventh stack reserved for the most serious crime and the first 
stack for the least serious. The list could include items such as, 
"Disobeying a stop sign," "Smoking marijuana," "Shoplifting," 
"Fraud on an income tax return," and "Armed robbery." Not only 
could the researcher find the relative seriousness of the crimes with 
one group of students, but the final scale from one group — say 
suburban high school students — could be compared to the final 
scale from other groups — for example, inner-city high school stu- 
dents, or rural adults, or recent immigrants to this country. In such 
a study the scale values actually assigned, indicating relative seri- 
ousness of the crime, would be compared for the disparate groups. 



Occupational Choice as a Measure of Interest 

Sometimes you may be interested in finding out if an instructional 
activity has altered the student's interest in some particular area. 
The science activity might introduce the work of marine biologists; 
the social studies activity for a semester might concentrate on 
law-focused social studies; or the English class unit might concen- 
trate on nonfiction writing. The question becomes: Has anyone's 
interest with regard to the topic been changed? If so, how? 

A simple measurement device is a forced choice occupation scale. 
To illustrate, take the example of law-focused education. Each item 
would contain three occupational choices, all with approximately 
equivalent salaries. One of them should be from a law-related field 
— for example, law enforcement, probation areas, law making, the 
penal and parole systems, the judicial areas, and litigation. A 
typical item might be: 



278 Educational Measurement 

L Army enlisted man 
State highway patrolman 



M Fireman 



The student is directed to mark an L by the occupation he pre- 
fers the least, and an M by the occupation he prefers the most. 
Approximately twenty-five items should be constructed. The items 
are scored 3 if the law-related occupation is chosen most preferable, 
2 if it is the middle choice, and 1 if it is the least preferred. For the 
example, since the law-related occupation is the second choice, the 
item would be scored 2. 



The Need for a Base Line with Attitude Measures 

"He's a tall fellow — a little over six and a half feet tall." 

You know what that means partially because you're familiar with 
people of that height, and partially because you know that means 
about 6 and a half feet more than zero. Ten pounds means ten 
pounds more than nothing. Zero degrees Centigrade means about 
273 degrees Centigrade more than as cold as anything can possibly 
get — a point called "absolute zero" in case you've forgotten past 
science training. All of these measures have a zero point — a refer- 
ence point which stands for the smallest possible amount. You 
don't need any other measures for comparison. You only need one 
single measure. 

Measures of acquired behavioral dispositions have no zero point. 
Suppose you constructed a Likert-type attitude measure of 25 
items, each with five responses ranging from Strongly Agree to 
Strongly Disagree. The preferred response was always scored with 
a 5, and the least preferred response with a 1. The highest possible 
score, obtained only when the student endorsed the extreme re- 
sponse at the preferred end on all 25 items would be 125 (25 X 5). 
Even a person scoring 25 cannot be construed as having no attitude, 
or the lowest possible attitude, toward the concept at hand. You 
undoubtedly can conceive of another person who might have an 
even worse attitude than the person scoring the 25. A single ad- 
ministration of the measure with one group really doesn't tell you 
very much, since the individual scores and average scores for the 
group have no absolute meaning. You'll end up saying, "Well, let's 
see, the mean was 83.7, is that good or bad?" 



"Acquired Behavioral Dispositions" 279 

Of course, standardized measures provide a comparison point. 
You can compare a group average or an individual score with scores 
reported in the standardization sample. But if you build an attitude 
scale for a class, or for some group in a project, you haven't even 
got a standardization sample for comparison. The most common 
way of finding a reference point is through one of two techniques. 
Either you seek out a control group, or else you use both a pre-test 
and a post-test with your target group.* 

With a pre-test and post-test design, obviously, the target group 
is measured before the experimental treatment begins, and then 
again at the end of the program. A control group is made up of 
another group of students, as similar as possible to the target group, 
but who have not had the "benefit" (the use of the word is pre- 
sumptuous) of the experimental or special program. If a control 
group is used, the need for pre-testing is not a mandatory one. 
However, with attitude measures, even with a legitimate control 
group, a pre-test and a post-test design following one of the hints 
to reduce the test sensitization problem is worth the trouble. The 
many problems with control groups, and the statistical handling of 
the data from either a pre-test and post-test design or from the 
use of a control group, are beyond the scope of this book. 



THE SOCIOGRAM: Social Relationships 
Among Students 

The teacher in the room is certainly a key determinant of the learn- 
ing climate, to be sure, but the amount of learning that goes on also 
depends on the way the other people in the room interact with one 
another. As the students get older, the pressure from and the im- 
portance of peers begins to be the dominant influence in the class- 
room. Only an unwise teacher wiU force a student to choose 
between loyalty to the teacher and loyalty to a peer. 

Actually, if you're like most teachers, it won't take you very 
long to get a fairly accurate perception of the social structure in 

4. Some thoughts on the problem of test sensitization were covered in 
chapter 8 as part of the discussion of reliability. These suggestions should 
be carefully considered in any pre-test and post-test design with attitude 
measures, since the number of items is usually small in an attitude measure 
and the items are generally rather vmique. Both aspects facilitate remember- 
ing the items over substantial time spans. 



280 Educational Measurement 

your classroom, but you must remember that your evaluations are 
those of an adult — and an outsider. You might be surprised at the 
degree to which some students are isolated. 

A simple technique for finding out the students' perceptions of 
one another is called a sociogram. The classroom teacher can find 
a sociogram an interesting device to help establish work groups, 
single out students who need special help in adjusting to the group, 
facilitate communication among different elements in the class- 
room, and to measure the effects of some special program on the 
social climate in the classroom. If your students do not seem to 
work well as a group, or if the group seems apathetic or even hostile, 
a sociogram might be a good place to begin trying to isolate, and 
solve, the problem. 

To construct a sociogram, wait until a need for grouping in some 
task comes along; then ask the students to choose the two or three 
other students with whom they would hke to work. A class does not 
have a single interaction pattern. One pattern will surface when 
the group works on some learning task; another when the topic 
moves to planning an activity; another when the grouping is for 
socialization only; and perhaps another still when the activity 
involves sports or an outdoor game. Decide which of these you wish 
to know about, and wait for (or invent) a real activity that allows 
you to gather the baseline information. 

For example, suppose you wish to know about the strictly social 
relationships in the classroom. Look for some forthcoming event, 
e.g. a field trip, which could conceivably involve small groups in a 
social setting. Ask each student to select two or three others for 
small groups on the trip. A key aspect of the sociogram is that it 
must not be phony to the students. Make the reason for gathering 
the data a real one, and when you have the data, use it in the 
manner you had originally promised. If no field trip is on the hori- 
zon and you want to establish a social relationship sociogram, you 
could promise to rearrange the desks in the room and ask which 
two or three students each individual would choose as next-door 
neighbors. Figure 10 shows an example of a form you might give to 
the students. 

You would be wise to figure out some way that the students (a) 
aren't looking over each other's shoulders; (b) aren't making hand 
gestures, bad faces, or nonverbal deals as they answer; and (c) 
have available a list of the names of people in the room so they 
don't have to be peering around at one another like a bunch of 
animals at the auction block. You wouldn't want your efforts to- 
ward improving social relations to backfire and damage someone's 



"Acquired Behavioral Dispositions" 281 



NAME 



DIRECTIONS: Next week we will be going on a 
field trip to Gerry's Giant GerbU Gymnasium. You 
will be grouped with three or four of your classmates 
at the Gymnasium. Indicate below which two class- 
mates you want to have in your group. (You don't 
have to choose anyone if you don't want to.) 



Figure 10 
Sample Form 

psyche beyond recognition, or to strengthen already existing 
cliques. To avoid the problem, try to figure out a way that between- 
student communication will be diminished during the completion 
of the form. 

The results from a class of fourteen students could be tabulated 
like this: 

STUDENT NAME TWO CHOICES CHOSEN BY HOW MANY? 



Olive 


Jessie (one choice) 


1 


Pete 


Tom, Milhouse 





Tom 


Harry, Jane 


4 


Jessie 


Olive (one choice) 


1 


Carolyn 


(refused to choose) 





Mary Jo 


Lynn, Jane 


1 


Harry 


Morton, Tom 


2 


MiUiouse 


Morton, Myron 


3 


Myron 


Morton, Milhouse 


2 


Laura 


(refused to choose) 


1 


Lynn 


Laura, Mary Jo 


1 


Jane 


Tom, Dick 


2 


Dick 


Harry, Tom 


1 


Morton 


Myron, Milhouse 


3 



To draw the sociogram, start with the "star" of the group — the 
person most frequently selected. You could put the boys' names in 
squares and the girls' in circles. Draw one way arrows to show the 
direction of the choice. A line between two students with arrow 
heads on both ends will indicate a mutual choice. The actual draw- 
ing of the diagram involves trial-and-error, so make your first 
attempts lightly in pencil. Figure 11 shows one tabulation of the 
results in the example. 

Now, what does all this tell you? 



282 



Educational Measurement 




A 



Pete 



Milhouse 



To 


t 


\ Morton ^ 1 


1> 4 \l 1 II 




Harry /^ \ \ 




^ Mvron 


Di 


/ 


ck / 






-► Mutual Choice 



-► Direction of Choice 




Girl 



Boy 



Figure 11 
Social Relations Sociogram for a Class of Fourteen Students 

Tom is the "star" of the group, chosen by more students than 
anyone else. 

Jessie and Olive make up an isolated pair — they seem to have 
withdrawn and their classmates are allowing them to do so. 

Pete and Carolyn are isolated. Pete is still reaching out to the 
group, though. Carolyn chooses no one and is chosen by no one. 

Milhouse, Morton, and Myron form a very tight little clique. 

Tom, Dick, and Harry and Jane form a somewhat looser clique. 
Class social activities probably revolve around these four. 

Not surprisingly there is a sex cleavage, with only Tom (choosing 
Jane) and Jane (choosing Tom and Dick) crossing over the line. 

The diagram provides information but gives neither reasons nor 
solutions. Why do Carolyn and Laura refuse to choose anyone? 
Should you do anything about it, and if so, what? Perhaps they're 



"Acquired Behavioral Dispositions" 283 

new students. Perhaps they could use some help from you which 
would foster interaction with peers. The teacher for this group 
should keep the cliques in mind, remembering that a decision by 
one member of such a group will undoubtedly be a function of what 
the other members feel is most appropriate. A member of a tight 
clique usually will not make decisions without consulting the oth^r 
members. 

This class has not become well integrated, especially on the 
female side of the diagram. The girls are all spread out in sort of a 
broken chain. The teacher might well establish settings so that 
students who are not linked by the arrows do work with each other. 

Some other comments about the use and interpretation of socio- 
grams in the classroom: 

To see if a student is really isolated, you could ask for more 
than two choices from each pupil. Carolyn, for example, might have 
been the third choice of five different students, which would indi- 
cate less isolation than the diagram suggests. The diagram gets a 
little tricky with more than two choices, but the information is still 
useful even without a diagram. 

Some suggest that you might also ask each individual to name 
the two or three students with whom he least wants to work. This 
gives a direct (and often devastating) picture of the rejected stu- 
dents in the class, and is a questionable and potentially harmful 
practice. Asking students to verbalize their dislikes tends to rein- 
force them. The responses to such a question can easily become the 
topic of playground conversation, where the rejected student be- 
comes the target of even more humihation. The sociogram based on 
positive choices will undoubtedly give you all the information you 
need about rejection. Seeking the information on rejection overtly 
seems to have more potential for harm than for good. 

You might be interested in obtaining sociograms at various times 
during the year to see how the social structure changes. This 
suggestion is especially appropriate where you are taking some sort 
of positive action to attempt to alter undesirable relationships seen 
in an earlier structure. 



Questions 

7. The fourteen students were asked to name the two students with 
whom they preferred to work on a science project. The results are 
given below. 



284 Educational Measurement 

STUDENT TWO CHOICES CHOSEN BY HOW MANY? 



Jessie 


Olive, Mary Jo 


Olive 


Jessie, Mary Jo 


Carolyn 


(no choices) 


Laura 


Mary Jo (one choice) 


Lynn 


Laura, Dick 


Mary Jo 


Olive, Pete 


Jane 


Tom, Dick 


Pete 


Tom, Mary Jo 


Tom 


Jane, Pete 


Harry 


Dick, Pete 


Dick 


Lynn, Harry 


Milhouse 


Morton, Myron 


Morton 


Milhouse, Myron 


Myron 


Morton, Milhouse 



a. Fill in the "Chosen by How Many?" column. 

b. Construct a sociogram from these results. 

8. Compare the sociogram you construct with the earlier one based on 
social relations. 

a. Name the students whose roles change sharply and describe the 
changes. 

b. What clear similarities do you see? 

c. Make some suggestions to the teacher about action which might 
be taken on the basis of the two sociograms. 

(NOTE: Questions like the above do not have any single, clear 
correct answer.) 



Summary 

An acquired behavioral disposition is a tendency to act in a certain 
way when confronted with some situation. Measures of interests, 
attitudes, values, and opinions can all be interpreted as measures 
of acquired behavioral dispositions. The measures are needed be- 
cause the actual observation of the student when confronted with 
the situation itself is often difficult, if not impossible. 

When some attitude measure is used to determine how a person 
noight respond in a given situation, the possibility of disparity 
between prediction and reality is a real one. Why does a student 
frequently say he will respond one way, only to follow a different 
script when the opportunity arises? A number of reasons can be 
brought forward, including the deliberate faking of responses to the 
measure for personal reasons, inability to follow a preferred plan 



"Acquired Behavioral Dispositions" 285 

of action, changes in the attitude of the student between response 
to the measure and response to the actual situation, new informa- 
tion which becomes available and causes the student to change his 
mind, and a change in his perceived situation as he responds to the 
actual situation. 

Keep in mind the distinction between the case of stable attitudes 
and the behaviors linked to them, and the case where a program 
attempts to change attitudes as well as the behaviors which should 
be linked to the changed attitudes. Here the relationship between 
expressed behavioral disposition and actual behavior is even more 
tenuous. To link the two, the student must internalize the informa- 
tion and must see how it relates specifically to him. In addition, the 
student cannot show the new behavioral disposition unless the 
opportunity arises. If the student has committed himself to a 
behavior pattern contrary to the goals of the attitude and behavior 
change program, then the probability of behavior change is sharply 
reduced, even if attitude appears to change. If the attitude change 
program is based on threats, it will have a stronger likelihood of 
changing expressed attitudes than actual behavior. Finally, the 
attitude measurement device might measure all but the key ele- 
ment, so the actual behavior may deviate from the predicted one. 

A Likert-type scale can be used to measure acquired behavioral 
dispositions in the classroom. A whole series of hints were provided 
to help you use this technique as part of your classroom measure- 
ment arsenal. The Thurstone technique, involving equal-appearing 
intervals, has also been briefly overviewed as a useful classroom 
interest-assessing tool based on preferred occupational choice. 

All measures of acquired behavioral dispositions are in need of a 
base line — a point of comparison to make interpretation meaning- 
ful. Sometimes this can be done with a set of norms; at other times 
a pre- test and a post-test design can be used with one group; and 
in some cases a control group is appropriate. 

Finally, a technique for finding out how the students view one 
another was demonstrated. This device, called the sociogram, is 
easy for the classroom teacher to use and can provide very interest- 
ing and useful information for improving the social interactions in 
the classroom and helping to integrate isolated members back into 
the group. 



12 



On Evaluating a Project: 
Some Practical Suggestions 



Measurement and evaluation have similar but not identical mean- 
ings. Measurement implies a specific device, such as a test, an 
interview, or a performance task, designed for a specific purpose. 
Evaluation implies something larger. To evaluate one usually has to 
measure, but the measurement is not usually of just one aspect of 
a system. The measurement is of the entire system. 

The first eleven chapters have been devoted primarily to mea- 
surement topics. Of course, an evaluation scheme could be devel- 
oped for some purpose by selecting a series of the measurement 
topics. For example, to evaluate a learning program designed to 
promote some specific philosophy regarding sex education, one 
might measure using criterion-referenced tests, standardized tests, 
interviews, and unobtrusive observations. Each of these is a 
measure. When you put them all together, they make up an 
evaluation. 

Although the suggestions found in this chapter may be of passing 
interest to anyone who is or has ever been associated with a funded 
educational project, the target audience is those people who are 
associated with the evaluation of such a project. This primarily 
means the project administrators and project evaluators. 

You will not find a single formula in this chapter. There will be 
no symbols and only a few numbers. The suggestions included are 
not based on some sort of well-developed overall theory; they are 
based on the experiences the author has had evaluating a whole 

286 



On Evaluating a Project: Some Practical Suggestions 287 

variety of different kinds of projects. As you will be able to imply 
from some of the later comments, not all of the experiences have 
been pleasant. 

The goals and objectives of funded projects vary across a broad 
range. Likewise, the techniques which the evaluator brings to bear 
must also be diverse. Evaluating a project is not a task just any 
bright and willing staff member can do. The evaluator, like the 
computer programmer, the accountant, and the engineer, must 
know the tricks of his trade. For some projects, these "tricks" are 
extremely complex, and the evaluator will need to have an extensive 
background. For others, the requirements are such that a regular 
staff person, responsible for other aspects of the project, could be 
trained to handle most of the evaluation efforts. 

This chapter begins with a series of practical and specific sug- 
gestions regarding the evaluation of a project. Following these 
suggestions, a particular general evaluation design will be outlined. 
Again, you will find this anything but a complicated and esoteric 
design. It is, in fact, simple, practical, and frequently very useful. 
The chapter ends with a description of some eveiluation training 
materials which can be used in project evaluation. 



Short Term/Long Term: 
The Impact of the Project 

The Assistant Superintendent of an elementary school district 
noticed two interesting things: First, some of the students were 
not making progress at learning to read as fast as he hoped they 
would; and second, some of the teachers weren't particularly good 
at diagnosing the specific problems these children were having. 
Being a good grantsman, he wrote a proposal to an agency and 
subsequently received funding for a project which was designed to 
solve both problems. 

Soon a project administrator was hired; testing people were 
brought in; and a reading specialist was assigned to each ten 
teachers. The reading and testing specialists were to help the 
teachers diagnose student problems. The Assistant Superintendent 
was happy; the school board and townspeople were impressed; the 
salaries of a number of people were augmented; and all seemed right 
in the world. 

Then the money ran out. 



288 Educational Measurement 

The administrators moved on to other "soft money"; the con- 
sultants too searched for other projects; the teachers started mis- 
diagnosing or ignoring problems again; and the old reading problem 
made a dramatic reappearance. 

The project was a failure. Or was it? 

That depends on whether you're talking about short-term bene- 
fits or long-term changes. A good evaluator must keep these two 
separate. 

In the short term, a lot of students obviously benefited from the 
program. The teachers and district personnel probably learned 
from the procedures. Some interest, excitement, espirit de corps, 
and activity were generated in the district. And the program ad- 
ministrators, staff, and consultants did all right, too. 

In the long run, the procedures died with the project funding, 
but maybe some positive things did occur. Maybe the funding 
agency learned that such an approach was not feasible without 
support external to the district. 

A first and very important decision by an evaluator is this one: 
Are these funds, time, and effort being expended to help or to 
change these students or these teachers or these administrators? 
Or are long-term changes envisioned — changes which will live long 
after the project funding ends? If the stated goal of the hypotheti- 
cal project described above was long-term change, it must be 
termed a failure. If the project outline only covered the particular 
student and teacher population which existed at the time of the 
funding, it did not fail — assuming each administrator, specialist in 
reading or testing, and consultant did, in fact, "do his thing" well. 

Most funded projects must be viewed as having some goals which 
are primarily of the long-term variety. These long-term goals may 
not be particularly explicit in the proposal, but they are frequently 
implicit in the relatively large amount of money spent on a small 
number of people. That is, when a district suddenly begins to spend 
50 percent more per pupil on a certain group of students for a three- 
year period — then turns the water off — it must be assumed they 
had something long term in mind. If not, the district probably 
should have spread the money equally among all of the pupils. 

How can you predict whether or not the project will have a 
long-term impact? Some warning lights for projects which probably 
will not have such impact can be offered. If a school district has a 
curriculum building project which does not closely involve its 
regular curriculum people; or a college has a project which is staffed 
primarily by "outsiders" on "soft money"; or if the "regulars" in 



On Evaluating a Project: Some Practical Suggestions 289 

any agency are not closely involved with the day-to-day operations 
of a project, the possibility of long-term changes is clearly limited. 
To find out if your project is headed toward the "short-term 
oblivion" route, do this little test. Periodically, say one day every 
month, make a checklist of all the things that happened in the 
project during that day. Decide which of these activities will con- 
tinue to occur when the funding runs out. If most of the activities 
are dependent on the outside money, you can predict what will 
occur when the money runs out. 



Concerning Car Salesmen: What, Specifically, 
Did the Project Writers Promise to Do? 

"This lil sweet 'arts 'bout the bes' '67 in town. Lo miles ... bin 
treat'd like a baby . . . bes' bargain in town at four-fifty." 

The car salesman wraps one single statement of fact in the same 
package with some half-truths, implications, and fast talk. The only 
fact was the price. If the car was not actually the best '67 in town 
or if it is not the best bargain in town, you won't really have any 
recourse later. The only thing he actually said he would do or 
guarantee was to sell the car for "four-fifty." 

Proposal writers are often like car salesmen. The evaluator has 
to separate the actual promises — the things the project will do — 
from the other implications. Usually, the project stafE will only be 
held accountable for things they specifically promised to do — ^just 
as the salesman is only accountable for one fact in the statement 
above. 

Now, most proposal writers are not dishonest. The writer has to 
build a case to show the background conditions which lead to the 
need for additional funds. A competent proposal writer makes the 
best possible case for the proposed project. 

Sometimes it's hard to separate the "we will do this" statements 
from the "flag and motherhood" parts. Proposal writers are clever 
at mixing them up. It is possible to cut through the rhetoric, and 
go right to the heart of a proposal, by looking immediately at a 
few key places. 

First, check the budget section. Anything the project is spending 
money for — a director, a desk, a test — is a "will do" item. You can 
use the budget to make a "hierarchy of objective importance" 
schedule, but this discussion comes later. Next, look for a timetable 
of the project. Most funding agencies now require that proposals 



290 Educational Measurement 

list the specific dates for completion of the various sections of the 
project. Any item listed here is also on the "will do" list. Finally, 
look for a section in the proposal which describes the activities of 
certain personnel to be hired. Any activity listed here is part of the 
"will do" category. 

If it seems as if the proposal is taking on some important task, 
but there is no budget set aside, no target dates promised, and no 
personnel promising to work on it, reread the proposal. That item 
is in the "bes' lil car in town" class! 



Evaluating Objectives: Two Interpretations 

Criterion-referenced tests, mastery learning, performance con- 
tracting, learning packages, behavioral objectives — these currently 
popular terms are all closely related to the notion of stating objec- 
tives specifically, a topic considered in chapter 2. Assuming the 
objectives of the project are stated specifically, the evaluator has 
the role of "evaluating the objectives." People interpret this role in 
two ways. It is important to make sure that the project people and 
the evaluator are both saying the same thing when asking for an 
"evaluation of objectives." 

To most evaluators, "evaluation of objectives" means "evaluat- 
ing to see if the objectives have been attained." This implies 
measures which are most suited to determining if an objective has 
been reached by the people toward whom the project was geared. 
The whole range of techniques, mastery tests, questionnaires, and 
interviews, can be brought to bear on the objective. 

However, some people interpret "evaluating the objective" to 
mean comparing the objective to other possible objectives. That is, 
they see this as requiring a value judgment of the objective, com- 
pared to others. 

For example, take this objective: "The student shall recall the 
equivalencies between the common metric and English units of 
time, length, volume, and weight." To evaluate the attainment of 
the objective, one would devise some sort of an achievement test 
asking the student to recall all, or a random sample of, these 
equivalencies. 

On the other hand, to place a value judgment on the objective 
would require asking questions like: Why should a student recall 
these equivalencies? Perhaps the student should simply recognize 



On Evaluating a Project: Some Practical Suggestions 291 

them, use them in context. Maybe estimating lengths, masses, and 
volimies is the proper manner in which students should "know 
about" the relation between English and metric units. Even 
broader, why is this information important at all? Valuable school 
time will be taken if the student is to reach this objective. Couldn't 
that time be better spent elsewhere, for example, in reading a news- 
paper or socializing with his peers? 

Placing a value judgment on an objective is obviously a difficult 
problem, but still very important. These decisions regarding value 
of objectives should be made by the people served by the agency 
housing the project. If the agency is a school, the constituency is 
the taxpayers, parents, teachers, and students. If the agency is a 
university, it has a similar constituency — the people it serves, the 
people who support it. A public service agency, such as the Parole 
Board, is designed to serve a certain group of people, and is sup- 
ported by another group. Both should be involved in making value 
judgments about an objective. 



The Budget: Sections Which Cost More 
Should Produce More 

Think back on the hypothetical situation outlined at the beginning 
of the chapter where the Assistant Superintendent tried to solve a 
reading problem. Suppose the budget looked like this: 

Administration $ 20,000 

Reading Specialists 84,000 

Test Specialist 10,000 

In-Service Training 1,000 

Secretarial 6,000 

Consultants 5,000 

$126,000 
Now the evaluator prepares a score sheet: 

GOOD INDIFFERENT BAD 



Administration 


X 


Reading Specialists 




Test Specialist 


X 


In-Service Training 


X 


Secretarial 


X 


Consxiltants 


X 



292 Educational Measurement 

That's 5 to 1— a good project. Right? Wrong! That's 84 to 42, 
or 2 to 1 — a bad project. That is, $84,000 was spent on a section 
which didn't M^ork out, while $42,000 was spent on the five sections 
rated "good." 

The moral of this story for the evaluator leads to this series of 
steps: 

a. List all of the "will do" statements. You may want to do some 
combining of these as long as the combined statements define a 
fairly discrete and homogeneous set of activities. 

b. Now go through the budget and, as much as possible, estimate 
the amount of each line item which should be allocated to each of 
the activities. For example, the "Materials and SuppUes" and 
"Travel" sections can be allocated to those activities which their 
inclusion in the budget was designed to enhance. 

c. Some of the money probably cannot be allocated. A project 
secretary, for example, spends time on aU phases. Don't try to 
apportion these line items — just ignore them. 

d. Now determine the percentage of total funds for each activity 
or group of activities. Make a circle chart which shows visually the 
proportion of your monetary resources which are devoted to each 
area. 

Of course, the project staff is interested in seeing that all of the 
goals and objectives of the project are fulfilled. Sometimes, though, 
priorities for extra effort must be established. If the major em- 
phases of the project — major at least in terms of monetary com- 
mitment — are made explicit, these extra efforts are more likely to 
be directed where the payoff will be most desirable. 



The Project Staff: Concerning Prior Commitments 
and Real Power 

First fable: The Assistant Superintendent wrote the proposal 
covering the alleged reading and diagnosing problem in his district. 
The proposal was funded. His Superintendent thought it might be 
nice for Mr. Assistant Superintendent to direct the project, so the 
School Board granted him a three-year leave of absence from his 
regular job and named him project director. 

Mr. Former Assistant Superintendent starts work as project 
director. But one day a few weeks later the Superintendent is faced 
with a problem that he knows the Former Assistant Superintendent 
used to handle beautifully. So he asks for a small favor "just this 



On Evaluating a Project: Some Practical Suggestions 293 

one time." These favors probably will continue, and soon the 
"project director" is working far less than full time on the project. 

The moral of the fable: If someone on the project staff was with 
the same organization prior to appointment to the project, look 
very carefully. Organizations — school districts, public agencies, 
imiversities, etc. — have a tendency to appoint a person from within 
for a new job without appointing a different person to the old job. 
Since the old job often goes unfilled, the new project person carries 
many of the old responsibihties with him. If a new Assistant 
Superintendent is not appointed to handle the responsibilities of 
the former Assistant, you can rest assured he is not committed full 
time to the project. 

Final Fable: As project director, the Assistant Superintendent 
recruits two teachers from each building to work with him on the 
project. The project picks up most of the teachers' salaries. He 
directs these teachers to begin working on materials for six- and 
seven-year-old children. One week later he checks again with the 
teachers and finds two disturbing notes: First, they had not com- 
pleted nearly as much as he expected; and second, they were also 
developing materials for pre-schoolers and eight-year-olds. In addi- 
tion, they seemed inquisitive to his urgings to comply with his 
original requests. He found out, after some searching, that while 
on paper the teachers were paid by and responsible to the project, 
in practice anything that happened in a particular building was the 
responsibility of the building Principal. Even if the teachers had 
not been holdover teachers from prior years, the lines of authority 
would have been blurred by the traditions of the district. By the 
way, the tradition of the building Principal being responsible for 
the happenings in his school is a widespread and honored one. 

The moral of this fable: You can tell who has the "paper power" 
simply by reading the proposal. Be very sensitive to the other issue, 
however, of "real" power. Who is always consulted when major 
decisions are made? Who is able to counter or change directions 
given by the project personnel? 

If the lines of authority are not clear, the efforts of project 
personnel can be seriously diminished. If the teachers in the above 
example do not know which person to respond to (the project 
director or the building Principal), they will probably not do a 
satisfactory job of either set of directions. Blurred lines of author- 
ity lead to all kinds of intrigue and inefficiency. The implication 
should not be drawn that the project staff be given absolute and 
unchallenged authority over that which happens under the project's 



294 Educational Measurement 

auspices. Obviously, these activities will affect the agency sponsor- 
ing the project and the agency needs to have a hand in the decision 
making. But the project staff cannot abdicate all decision-making 
responsibility in favor of the personnel in the sponsoring agency. A 
balance must be struck and the project evaluator must find out and 
make expUcit (to all if possible, but at least to himself) just what 
the specific lines of "real" authority are. 

TIMETABLE: If a Child Is to be Born in October, 
Conception Must Occur Somewhat Earlier 

"By June 1, 1973, the project staff will have received the approval 
of the School Board to pilot test a series of special reading programs 
in certain schools in the district." 

An important phase of the project evaluator's work involves 
working out timetables. Think of the very simple outcome de- 
scribed above. A project staff which figures, "We'll get the programs 
written up and to the Board when we're ready," will either not 
reach the objective or will be doing frantic and inefficient last 
minute scrambling to make the deadline. Here is the kind of task 
analysis planning which should go on: 

a. If the deadline is June 1, what is the last regular meeting of 
the School Board before June 1? Suppose the answer is May 20. 

b. How long prior to a meeting must an item involving approval 
of a new program be submitted to the School Board? Tradition 
suggests one month. Thus, the completed document must be in the 
hands of the members by April 20. 

c. Traditionally, how long does it take for a completed draft to 
make it through typing, duplicating, collating, and mailing? Sup- 
pose, in this district, the time is two weeks. The date is now back 
to AprU 6. 

d. How much time should reasonably be set aside for the prepa- 
ration and planning of the proposed program, including obtaining 
background information, allowing time for writing of suggestions, 
scheduling meetings, and so forth? Suppose this requires two 
months. The date is now February 6. 

e. Assume the staff and committee members must be recruited. 
How long does it take to carry out these tasks? A good person 
usually does not get interviewed one day and change jobs the next. 
A person needs time to arrange his schedules and fulfill prior com- 
mitments. Figure three months from the beginning of recruitment 



Ore Evaluating a Project: Some Practical Suggestions 295 

to the time when people will be working on the task. Recruitment 
should begin by November 6. 

To be sure, some of the estimates you make will be flexible. 
Shortcuts could be taken at certain points. But if the project staff 
wants this job done by June 1, they should have the item on the 
agenda on May 20, the information out by April 20, the draft done 
on April 4, and so forth, all the way down the line. If the project 
staff doesn't even start moving seriously on an item until long after 
these dates, the possibility of nonattainment of the goal, or of 
substandard attainment, is very high. A very important responsi- 
bility of the evaluator is to make sure these final outcomes have 
been task-analyzed to determine the necessary antecedent activities 
and to assign estimated dates to the completion of each. He then 
should help the project staff conform to this estimated timetable. 



Individualized Instruction Requires 
Individualized Reporting 

Recently, the author was called into a federally funded project 
designed to improve reading instruction in a group of elementary 
school districts. The core of the project was individualized diagnosis 
by the teacher and individualized learning by the student. However, 
the reporting system which the project staff had devised was built 
around such things as means, medians, standard deviations, and 
group change scores. 

Doesn't that seem a little silly? If the project is to concentrate 
on individual growth shouldn't the reporting system also concen- 
trate on individual growth? A report from such a project should 
include a separate statement about each individual. These state- 
ments should report (1) if the child had changed by the expected 
amount, and if the expected change had not occurred (2) an expla- 
nation for the failure. If the people on the project say they haven't 
enough information to make such a report, then the project isn't 
really an individualized instruction project! 

How does one report individualized growth? If the students are 
young and the area is one covered by an achievement battery, a 
predetermined growth in terms of grade equivalents might be used. 
Otherwise, some preset number of standard error of measurement 
units could be required. Both of these concepts — grade equivalents 
and standard error of measurement — are covered in chapter 10. 
The expected growth could be expressed in a variety of ways. The 



296 Educational Measurement 

only requirements should be that the amount be reasonable to 
expect and acceptable to those in charge of bringing about the 
growth. 

Individualization often takes a slightly different form when a 
project is concerned with people beyond the middle school level. 
Up to the middle school level the goals of instruction are much 
the same for sdl students. Individualization comes in the methods 
used in reaching the goals. Later on, however, individualization 
implies meeting different types of goals for the individuals. 

For example, the author is familiar with a project designed to 
prepare teachers, school administrators, and college faculty to deal 
effectively with the education of students who are not part of the 
"middle-class" group. The project required the interns to spend 
time in ghetto schools, social agencies, and university courses. The 
program was supposed to be "individualized." Individualized, in a 
project like this one, meant that the student should be able to find 
ways to fill his personal needs. An intern with five years of experi- 
ence in a ghetto school should not be required to have another year 
of experience here, but instead be able to find an effective way to 
use that time. 

To evaluate the success of such a program, group reporting sys- 
tems would not be satisfactory. For this project, a technique was 
devised wherein the interns were asked what their expectations 
were before the project started. A whole list of possible activities 
were listed and the interns responded by giving high and low pri- 
orities. Half way through the year the same list was presented and 
the students were asked which of the activities they were actually 
participating in. At the end of the year, the statements were re- 
phrased in the past tense. Each person was asked which activities 
had actually been part of his or her learning program. 

The success of the project's individuaUzed part, then, was mea- 
sured by the congruence between what each individual expected 
and wanted and what actually occurred in his or her program. Any 
group sort of report might just hst the number who were able to 
fill individual needs and those who were not. 



Continuous Assessment: A General Evaluation Model 

My wife is a wonderful cook, but I can't stand to watch her. She'll 
make a beautiful roast, put the necessary spices and so forth on it. 



On Evaluating a Project: Some Practical Suggestions 



297 



shove it in the oven, set the timer and leave it until the bell rings. 
She never opens the oven door while the meat is cooking. When it's 
done, she looks at the product and says, "That looks fine," or 
"Maybe I should have cooked it longer," or "This is too well done." 
A lot of project evaluators work that way too. They do some 
things at the beginning, go away until the end, then look at the 
product and make certain evaluative statements. This is called 
"outcome evaluation." "Intervention evaluation" seems better — 
that is, looking at the product while it is cooking so that midstream 
changes can be made which will raise the possibihty of a satisfac- 
tory product. One type of intervention evaluation utilizes continu- 
ous assessment. The two graphs of figure 12 illustrate the 
difference. 1 



-■High 



X 



--High 



--Low 



X 



I I I I I 



I I I 



X 



X 



X 



X 



--Low 



X 



I I I I I I 



Sept. 



May 



Sept. 



May 



OUTCOME EVALUATION CONTINUOUS ASSESSMENT 

Figure 12 

The outcome evaluation provides a mark at two points on the 
graph. To be sure, it informs the director of initial and final per- 
formance levels. Continuous assessment also provides this informa- 
tion, and in addition, tells the project the path through which the 
participants passed in reaching the final performance level. As the 
second graph shows, it may be important to know that most of 



1. This discussion of a general evaluation model is presented in more 
detail in chapter 3 of John W. Wick and Donald L. Beggs, Evaluation for 
Decision-Making in the Schools (Boston: Houghton MiflSin, 1971). 



298 Educational Measurement 

the gains shown on the graph had occurred by the time the school 
year was about two-thirds over. 

Continuous Assessment is based on these five premises: 



1. Group Statements and Individual Statements: Which? 

The evaluation and reporting for some projects revolves around the 
individuals in the program. A project to train teachers of the deaf, 
for example, may emphasize the progress made by each teacher in 
the program. A clinical program focusing on correction of a speech 
problem would probably also require that each and every child who 
passes through the clinic be a part of the testing program. Report- 
ing on each individual would be required. Group averages could also 
be reported in each case, but as part of the program, each partici- 
pant would be part of the evaluation efforts. 

Sometimes projects work with a sample of a population, with the 
intention of inferring that the changes which take place in the 
sample would also have happened to the others in the population — 
if they had been part of the project. In this sort of situation, it 
doesn't really matter who is selected or who is tested, as long as 
those who are measured provide an accurate representation of the 
larger group. Here it would not be mandatory that each and every 
individual be part of the evaluation program. The only mandatory 
thing would be that enough take part so that the results are 
reliable. 

The first decision, then, is the group/individual decision. Is your 
project going to report the results in terms of each and every 
individual involved, separately, and not simply in a summary form? 
If so, each person must be part of each evaluation step. On the 
other hand, if your major reporting emphasis is on composite or 
group results, with actual or impUed inferences to a larger popula- 
tion, it is not absolutely essential that every participant take part 
in every phase of the evaluation. Our experience with project evalu- 
ations has been that the primary reporting for most projects is in 
the group sense, except in cases described in the previous section. 

2. Fit Your Sample Size for Each Measure to the Accuracy 
You Require 

Consider this example: A certain standardized measure, say an 
achievement test which accurately measures the major cognitive 



On Evaluating a Project: Some Practical Suggestions 299 

objectives of your program, has a standard deviation of 10, which 
is about right for a 60 to 75 item test. Assume further that one 
thousand people are participating in your program, and you want 
to measure the change that takes place with these people. 

If you test all one thousand at one time, with the standard 
deviation of 10, a change of about two-thirds of a point will be a 
statistically significant change. If you test half of the thousand, a 
change of just about under one point would be statistically signifi- 
cant. If you lower that to just one hundred tested, the change 
required goes up to about two points for significance. As you lower 
the sample size, the required amount of change for statistical 
significance increases. 

So you should use the largest sample size possible so a very 
small change will be statistically significant. Right? No! First, 
decide how much precision you need, then make your sample size 
fit. To use a larger sample than you need is wasteful. 

This is an old example, but still appropriate. If the Principal of 
a school finds that 84 percent of the students actively support 
burning the building to the ground, it really doesn't matter if the 
99 percent confidence interval is from 74 percent to 94 percent or 
from 83.9 percent to 84.1 percent. As long as he is 99 percent 
confident that the actual figure is very high — higher than 50 per- 
cent or higher than 20 percent for that matter — he has enough 
information to cause him to take immediate action. If a sample of 
20 was large enough to establish the 99 percent confidence interval 
between 74 percent and 94 percent, then 20 was enough. Any 
further sampling would be a waste of time and effort. 

In a particular project, to get the sample properly sized, two 
decisions must be made. The project staff makes the first by 
answering this question: How large must the change in population 
of participants be before we will consider it a meaningful change in 
the practical sense? Then the statistician must work from that 
answer to determine how large the sample must be so that if a 
change as large as that specified does really happen, the sample 
will be large enough to be sensitive to it. 

3. If the Proper Sample Size Does Not Utilize All of the 
Participants at One Setting, Space the Testing Out 
Over a Series of Time Intervals 

Consider again the project with one thousand participants. Suppose 
a testing of one hundred was satisfactorily precise. The project 



300 Educational Measurement 

director could then divide the group into ten random samples. If 
each sample were tested approximately six weeks apart, the prog- 
ress could be recorded at six-week intervals over the course of the 
year without any additional cost above testing all one thousand at 
one time. Such a design would undoubtedly offer far more in- 
formation. 

Now, don't get the mistaken impression that samples numbering 
into the thousands are needed to carry out this design. A number 
of tunes, in evaluating various projects, a team including the author 
has interviewed project participants. In a good working morning, 
an interviewer talks with about five participants — give or take a 
couple. It is always amazing how accurately the interviewers can 
predict all future comments based on the comments of the first few. 
A random sample of five in this case is enough for fairly accurate 
predictions of the general tone which permeates the project, the 
attitude of the group toward its leadership, the amount of commit- 
ment on the part of the participants, the degree to which the 
average person is reaching stated goals, and so forth. To be sure, 
these perceptions become sharper and more filled with anecdotal 
evidence later — but basically a very small sample is satisfactory 
for general reasons. This implies, of course, that it is a random 
sample and not a bunch of "ringers" picked by the project director! 

4. Try to Get the Spacing Between Testings Far Enough Apart 
So That Meaningful Changes Have Time to Occur 

This sounds like talking out of both sides of the mouth — first 
you're urged to evaluate continuously, then to get the spacings far 
enough apart. If you space your measures too far apart, you miss 
the path through which the participants passed in reaching their 
final performance level. If you place them too closely together the 
changes may be too small to be measurable. The results will be 
misleading. 

What is "long enough"? That depends on the complexity of the 
change you are seeking. A change in performance level on some task 
— knowing the answers to a series of lower-level cognitive ques- 
tions, administering an individualized intelligence test, or being 
able to properly reshelve books in a library — can be expected 
quickly, as long as practice time is available. For each task above, 
a spacing of one week would not be too small. 

Suppose you wanted to change the attitude of a group of experi- 
enced elementary teachers about the frequency of referrals of 



On Evaluating a Project: Some Practical Suggestions 301 

"problem" children to special classrooms. That is, change the 
teacher's tendency from "Get that kid out — anywhere]" to a 
serious attempt to solve the problem in the classroom setting. You 
could not expect this change to happen quickly. This type of be- 
havior change program would require a considerable amount of 
information, anecdotes, examples, practice, and evangehsm before 
long-held responses would be overcome. One-week measurement 
spacings would probably be too small. It seems more realistic to 
space measures of this type of attitude change at four- to six-week 
intervals. 

Most real situations probably fit somewhere between the two 
examples. That is, a spacing of from one to six weeks between 
testings is usually appropriate. 

5. Find the Aspects of the Project Which Are Appropriate to a 
Continuous Assessment Design and Implement Such a Program 

Which aspects are appropriate? Look at each aspect of the project 
wherein change is possible or expected. If continuous information 
would keep the project focused on its objectives, and raise the 
possibility of reaching them, then continuous information should 
be gathered. Of course, the four preceding premises provide re- 
straints. Perhaps you do not have enough participants for the 
preferred spacing of testing; perhaps you demand such a level of 
precision that all participants must be used at one time; or perhaps 
reporting must be individualized so that a sampling program is 
inappropriate. Even given these restraints, you will probably find 
that a large proportion of the objectives of the project can be 
appropriately designed into a continuous assessment program. 



Some Other Roles of an Eualuator: Keeping 

Communication Lines Open Among All Groups 

in the Project 

Every project reaches a variety of groups. The groups usually have 
their own interests, and have probably learned to communicate 
with each other in well-established ways. For example, the program 
with the Assistant Superintendent described earlier will involve 
students, teachers, administrators, some reading speciahsts, prob- 
ably a few university professors, and maybe even some graduate 



302 Educational Measurement 

students. Teachers and administrators usually have an employee- 
employer sort of relationship. Two-way communication, even 
though badly needed by the project, will be difficult to establish 
where prior tradition is strong. Building administrators have es- 
tablished communication channels with central office administra- 
tors; the manners in which reading specialists communicate with 
university professors may be fixed, and so forth. To be successful, 
the project may require a level of communication among groups 
which is not likely to occur without some sort of externally imposed 
"greasing of the communication skids." 

The evaluator is in a good position to be a communication facili- 
tator. This is especially true if the evaluator is external — that is, 
not a regular or previous member of the agency housing the project. 
If the evaluator is external to the agency funding the project, he 
will be relatively free from intimidation or coercion from any group 
in the project or in the agency. From this perspective it will be 
possible for him to estabUsh communication between different 
groups by insuring anonjanity to all who respond to his questions. 
Without anonymity most people hesitate to make comments about 
persons higher up in the organization, especially if the comments 
might be construed as being critical. If the evaluator can insure 
anonymity and establish his own credibility, this hesitancy will 
evaporate and two-way communication will be possible. 

The "Action-Now" Philosophy 

Most graduate training programs in education now require at least 
some work in the area of educational statistics. The number re- 
quiring a course in testing is probably much smaller. However, very 
few offer courses which are basically "evaluation" courses. Thus, it 
follows that most evaluators have been trained in the areas of 
statistics and research design with some emphasis on testing. This 
type of formal training often leads to two evaluation modes for 
projects, namely (a) a pre-test/post-test design; and (b) a heavy 
emphasis on an achievement testing technique. While such tech- 
niques may be appropriate to some projects, they are not appropri- 
ate universally. 

All projects have certain goals and objectives at the outset. The 
project staff has ideas, some vague, some very specific, about the 
manner in which these goals and objectives can be reached. 
The evaluator can do one of two things: Baseline data can be 
gathered at the start of the project and then repeated near the end 



On Evaluating a Project: Some Practical Suggestions 303 

of the project to see if the goals were met; or the evaluator can be 
involved at every step, attempting to determine if procedures 
should be altered at some point to raise the probabiUty of fulfilling 
the original goals and objectives. These intervention evaluations 
should involve the whole range of measurement techniques: inter- 
views, scales, questionnaires, verbal communications, content anal- 
ysis, observations, and others. Any technique which provides 
evaluative information to the project staff is appropriate. 

The above points were established earlier, during the discussion 
of the continuous assessment evaluation design. The "action-now" 
philosophy is broader than any particular design, however. Regard- 
less of the design, the "action-now" type of evaluator will be 
involved at all stages of the project, helping the project staff in its 
efforts to meet the project objectives. This calls for continuous 
data collection, many different measurement techniques, quick 
feedback of results to project staff, and interpretation of the results 
in language which is meaningful to them. 



What Do the Numbers Mean? 

The very last part of the sentence above needs a few additional 
comments. The evaluation results aren't of any value at all if the 
people who make the decisions in the project cannot interpret 
them. Many project administrators are quite uncomfortable with 
numerical results of any kind. The evaluator must provide the 
project staff with more than just the results. He must also provide 
information on probable implications, points of inaccuracy, infor- 
mation which may be unreliable or biased, and suggestions for fur- 
ther data gathering. Evaluation reports which are designed to help 
the project decision makers should not be written as scholarly 
articles for peers in the evaluation business. They must be under- 
standable by those who need the results. The test of the evaluation 
is not the technical beauty of the measurement devices. The test is 
the usefulness of the results. 



Training Internal Project Evaluators 

Every project needs some sort of evaluation. No project is begun 
without goals, and the only way to find out if the goals have been 
met is to evaluate the process and the results of the project. 



304 Educational Measurement 

But not all of the evaluation tasks are so complex that an evalua- 
tion specialist need be called in. With a minimal amount of training, 
someone associated with a project in another role can be trained to 
handle a substantial proportion of the evaluation tasks. 

Training programs wherein staff members can "tool up" for 
internal evaluation tasks exist around the country. An example can 
be found at Northwestern University.^ The program is constructed 
to be individualized and self-instructional, built around a slide/ 
tape/programmed approach. Topics include sampling techniques, 
content analysis, historical research, developmental studies, experi- 
ments, common errors in evaluation and research, questionnaire 
construction, standardized tests, interviews, data presentation, 
computer applications, and specifying behavioral objectives. Each 
topic requires about two or three hours of interaction with the 
programmed materials. 

2. For further information, contact the Center for the Teaching Profes- 
sions, Northwestern University, Evanston, Illinois 60201. 



Appendix: 

Use of Normal 

Distribution Probabilities 



One of the most useful models in the world of educational statistics and 
testing is that of a normal distribution. "Model" as used here implies a 
simple, although comprehensive way of interpreting test scores. Figure 
A-1 shows the well-known shape of a normal distribution. 



The frequency 
with which 
the score 
occurs is on 
this axis 




Low End 



Some sort of scores or measures are 
arrayed along this axis 

Figure A-1 
The Shape of a Normal Distribution 



High End 



305 



306 Appendix 

The normal distribution is defined by a complicated mathematical 
formula. The usefulness of the normal curve is not a function of the 
complicated nature of the formula, however — many complicated for- 
mulas exist in mathematics. The usefulness comes from the large num- 
ber of real-life score distributions which take on a shape approximately 
like that of a formal distribution. Distributions of test scores, means of 
groups, heights of males, and number of random errors made in test 
situations are all approximated by a normal distribution. 

From figure A-1, you can see that the scores of some test (or the 
heights of males, mean score for groups, or whatever is being tabulated) 
are arrayed along the horizontal axis. The frequency with which each 
of these scores occurs is on the vertical axis. Thus, you can see that the 
normal distribution curve, being high in the middle and flaring out at 
both ends, does approximate many natural occurring score distributions. 
Consider, for example, the distribution of heights for adult males. While 
basketball teams have their seven-footers and the author includes some 
very short men among his acquaintances, most men are around the 68 
to 70 inch range. The distribution of test scores from a test like the 
American College Test (ACT) are roughly normally distributed, since 
most students tend to score toward the middle, and a much smaller 
percentage of prospective college students score at either extreme. 

The concept of a standard error of measurement has been presented 
in the text. The introduction to the use of normal distribution probabili- 
ties will be tied to examples based on the standard error measurement. 
This concept is based on these assumptions: 

1. The items in a test are merely a sample from a whole domain of 
items which could have been written about that particular content. 

2. A person's test score is also just a sample of that person's perform- 
ance on the domain of content. 

3. This implies that if a different sample of items had been selected, 
the person probably would have a different score. In fact, the implica- 
tion is that a whole series of tests could have been constructed, and a 
whole series of scores could have been accumulated. 

4. The average of this whole series of scores which the person could 
conceivably obtain from the series of tests is viewed as this person's 
real performance level. 

5. So, when the person takes only one single test (and not a whole 
series) it is interesting to speculate how close this test score is to the 
real performance level for this person. And this is the place where the 
normal distribution comes in. 

6. The assumption is made that the single observed test score will 
usually be close to the real performance level, but occasionally, the dif- 
ference between the two will be substantial. Based on the assumption of 
a normal distribution centered on the person's one observed test score, 
the picture looks like the distribution in figure A-2. 



Appendix 



307 




Middle of Distribution is 
observed score 

(The numbers on the horizontal axis refer to the number 
of standard error measurement units above and below the 
observed score.) 

Figure A-2 

A Normal Distribution Showing S.E.M. Units 

Arrayed Around Observed Score 



If you cannot recall how to compute the standard error of measure- 
ment, review this presentation in chapter 9. 

Now, based on the assumed normal distribution, probability state- 
ments can be made regarding the location of the person's real perform- 
ance level. Table A-1 shows some of the probabilities associated with a 
normal distribution. 

For example, take the case of a girl who attains a score of 120 on a 
test. The computed standard error of measurement is 6.0, based on a 
standard deviation of about 16 and a reliability slightly above 0.86. All 
of the figures roughly approximate what might be expected from a com- 
monly known IQ test. The normal distribution diagram for this girl 
looks like figure A-3. 

The girl scored 120. What is the probability that her real perform- 
ance level (based on multiple testing using different samples of items 
for each testing) is in excess of 126? The diagram shows two scoring 
scales. The upper one shows the number of S.E.M. (standard error of 
measurement) units above and below her score. The lower scale trans- 
lates the upper scale to actual score imits. The S.E.M. equals 6.0, you 
recall, so one S.E.M. above the score of 120 is (120 + 6) or 126. 

What is the probability of her real performance level being in excess 
of 126? Look at table A-1. The table can be interpreted easily if a few 
things are kept in mind. Look at figure A-4. 



3 

<5 



'•gS 



ooooooooooooooo 



to 



e 
o 



•a 

u c 

<« > a 

i- <a ^ 

o tH 

^ ui O 

3 3 N 
C " 

■-.2-s 

C? g. 

l-oS 



^ CQ 



U 



B eg V 
<, tn w 



(NT};cooqq(NTj;cqoqO(NTt;tD«o 






Eh 
a 

S 



"IE 






ooqoooooooooooo 
ooooooooooooooo 



S ? u 

^ _ c 

en > c 

^ V 



(NTj<CDoqOCs]'^COXOC<lTj<CDOOO 
OOOOi-HrH,-HrHr-lc^(N(NC^(>icO 

I I I I I I I I I I I I I I I 



Appendix 



309 




(Horizontal axis is the number of S.E.M. units above or 
below the score.) 

Figure A-3 

1. The table shows the proportion of area above or below the middle 
to some designated point (like X or Y) on the horizontal axis. Thus, 
you can read areas B and C directly. 

2. All of the area under the curved line and down to the straight 
horizontal line is designated as a proportion of 1.00. An area denoted 




Middle 



X 



Y = some number of 
S.E.M. units below 
the middle 



X = some number of 
S.E.M. units above 
the middle 



Figure A-4 



310 Appendix 

by A, B, C, and D on the figure is obviously less than all of the area, 
so the proportion for each will be less than 1.00. The four areas taken 
together, though, make up almost all of the area. That is, the propor- 
tions in A + B + C + D equal almost 1.00. 

3. The curve is symmetrical. Thus, one-half of the area is to the 
positive side (right side) and the other half is on the negative side. In 
terms of proportion, 0.50 of the total area is to the right of the middle, 
and 0.50 is to the left. Thus, again, considering areas in terms of pro- 
portion of total area, A + B equals about 0.50 and C + D also equals 
about 0.50. 

4. The only values which can be read from the tables are the propor- 
tions between the middle and a score above or below the middle. That 
is, the proportion in B and C can be read directly from the table. 

5. How about the proportions of area in A and £)? Is another table 
needed? Of course not — any ninth-grade algebra student will tell you 
that a A + B = -0.50, then A = 0.50 - B. Since B can be read di- 
rectly from the table, A can be computed. The same reasoning goes for 
area D. 

6. Now, the required link between proportion of area and probability 
of occurrence — an event which has no probability of occurrence is said 
to have probability zero (0.00) . An event which is certain to occur is 
assigned a probability of 1.00. Events with probabilities between these 
two (no probability and certain probability) are assigned probability 
values between 0.00 and 1.00 (Example: The probability of heads turn- 
ing up in the flip of a coin is 0.50 or one-half) . All of which leads to 
this: The relationship between proportion of area and probability of 
occurrence is a direct one. The two phrases are used interchangeably. 
That is: 

a. If 0.30 of the area is in section A of figure A-4, then the probability 
of a score in that area is also 0.30. 

b. If 0.50 or half of the area is above the middle, then the probability 
of a score above the middle is also 0.50 or one-half. 

Remember back to the example being dealt with here — the girl has 
scored 120 on the test. By setting the obtained score (120) at the mid- 
dle of the normal distribution, the assumption is made that 0.50 of the 
area (or probability) is above 120, and the other half is below. The 
assumption seems reasonable. Based on a single observation, you really 
can't tell if the person's real performance level (based on repeated test- 
ings with different samples of items) will be above or below the single 
score. 

Now for a series of examples directed at reading normal probabilities 
and interpreting the results: 

Example 1: On a single test, the girl scored 120. What is the prob- 
ability that her real performance level is more than 126? 



Appendix 
Solution: 



311 



Draw a little picture as follows (figure A-5) 



a. Draw a rough normal distribution outline. Show the middle line. 
Divide the lower axis into three parts on each side of the middle. 




Figure A-5 

b. Label this horizontal axis two ways (figure A-6). The upper 
scale always is labelled with the same numbers of Standard Error of 
Measurement (S.E.M.) units. In theory, there is really no limit to the 
number of S.E.M. units above or below the middle that a score can be. 
Practically speaking, though, marking three units above and below as 
shown encompasses about 99.9 percent of the area. 

Also, mark a second scale directly below the S.E.M. scale. This is the 
scale of actual scores. Directly below the middle is the observed score — 
a score of 120, for example. The S.E.M. for the example is 6. One 
S.E.M. above 120 is 120 + 6 or 126. Thus, 126 is marked below the +1 
on the S.E.M. scale. One S.E.M. below 120 (at the -1 S.E.M. point) is 
marked 114 (since 120 - 6 = 114). The other points are 132 (for +2 
S.E.M.), 138 (for +3 S.E.M.), 108 (for -2 S.E.M.), and 102 (for -3 
S.E.M.). 

c. Next, shade in the area which is the focus of attention for this 
problem (figure A-7) . The question: What is the probability of a score 
in excess of 126? Now you can see why the score scale is on your dia- 
gram. Draw a vertical line above 126 and shade in the area beyond (to 
the high side of) this line. Now the link between proportion of area and 
probability of occurrence becomes important. All you need to do is find 
the proportion of area in the shaded zone and the question is answered 
— since the proportion of area is the same as the probability of occur- 
rence. 

d. The area is like a D area described earlier. You cannot, in other 
words, read it directly from the table. You can read directly the area 



312 



Appendix 




S.E.M. 


—3 


-2 


—1 


+1 


+2 


+3 


Score 


102 


108 


114 120 

Figure A-6 


126 


132 


138 



from 120 (the middle) to 126 — that is, the area from the middle to one 
S.E.M. above the middle. This value, from table A-1, is 0.341. Since half 
of the total amoimt of area (or 0.500) is above the middle, computing 
the shaded area required only that you find what's left of the 0.500 after 
the 0.341 is taken away. This computation, as shown in figure A-8, 
leaves a remainder of 0.159. That is, 0.159 of the total amount of area 
is above +1 S.E.M. , or in this problem, above a score of 126. In more 
practical terms, 0.159 roimds to about 0.16. For this problem, where the 
girl scored 120 on the test, it is assimied the chances are 16 in 100 that 
repeated testings would show her real performance level to be in excess 




Figure A-7 



Appendix 



313 



of 126. How could this information apply to a real decision-making 
situation? 



! y Read from table = 0.341 



Compute: 




0.500 
0.341 



-3 


—2 


—1 





+1 


+2 


+3 


102 


108 


114 


120 


126 


132 


138 



Figure A-8 

Suppose ability groups were being established on the basis of these 
scores and that a score of 125 or above was required for assignment to 
the highest group. This girl, who scored 120, would be assigned to a 
lower group, but the probability is at least 1 in 6 that she has been 
misassigned. The probability is that her real score is in excess of 126. 



Example 2: The standard error of measurement for a certain test is 
5.0. Elmer's score on the test is 30. What is the probability that his 
real score is between 24 and 36? 

Figure A-9 shows a composite of steps (a), (b), and (c) from the 
first example. The shaded area is the focus of the question. Table A-1, 
you will note, requires that the score be expressed in standard error of 
measurement units. Thus, you must change the scores to standard error 
imits. The score 36 is 6 points above the middle (36 — 30 = 6) . Since 
one standard error of measurement equals 5 units, the score 30 is 6/5 or 
1.2 standard error units above the middle. Likewise, the score 24 is 6 
points below the middle (since 24 - 30 = — 6) , and so the score 24 is 
— 1.2 S.E.M. units below the middle. 

Now, reading from the table, moving from the middle to +1.2 S.E.M. 
units will utilize 0.385 of the area imder the curve. Likewise, moving 
from —1.2 S.E.M. units below the middle to the middle will utilize 
0.385 of the area. Thus, the total amount of shaded area is 0.385 + 
0.385 or 0.770. Since probability of occurrence and area included are 
used interchangeably, the answer to the problem is that the probability 



314 



Appendix 




—3 


—2 


-1 +1 


+2 


+3 


15 


20 


25 30 35 

Figure A-9 


40 


45 



of a real performance level between 24 and 36 is 0.770 — a little better 
than three chances in four. 



Example 3: The standard deviation of a test is 25 and the reliability 
is 0.84. 

a. What is the Standard Error of Measurement? 

b. Tom's score was 54. What is the probability his real performance 
level is below 38? 

Solution 



a. From chapter 9, S.E.M. 



= S.D. V 100 - REL 

= 25 V 1.00 - 0.84 

= 25 V"0^ 

= 10 



should draw. The 
-1.6 S.E.M. units. 



b. Figure A-10 shows the diagram you 
38 is below the middle, and translates to 

(38 - 54) 
[ 10 = —1.6] From table A-1, the area between —1.6 and the 

middle is 0.445. Thus, the area below (to the left of) -1.6 is 0.500 - 
0.445 or 0.055. This is the area below 38, and is also the probability that 
his real performance level is below 38. 



Example 4 : The standard deviation of a test is 20. The mean is 60. 
Only the top 12 percent are to be assigned to a special honors program. 
What score should be designated such that all at or above that score 
should be enrolled in the program? 



Appendix 



315 




Figure A-10 



Solution 



a. First, assume the distribution of scores is approximately normal. 
Unless the test results are peculiar, this assumption will be a satisfac- 
tory one. 

b. Now draw a diagram like the one in figure A-11. Here the focus 
is a total distribution and not the concept of S.E.M. The middle of the 
mean score, and the standard deviation is used to mark the lower scale. 
The task is to find a score such that about 12 percent of the scores are 
at or above that point. Notice that a portion of the upper end is shaded. 





y 




^ 


/y/y^y^^ 


standard 








.% 


^^^ 


'>,.,^ Deviations 


—3 


—2 


—1 +1 


+2 


+3 X 





20 


40 60 80 


100 


120 






Figure 


A-11 







316 Appendix 

The problem implies this upper extreme — "only the top 12 percent." 
You don't know yet exactly where to draw the vertical line — it is shown 
between 1 and 2 standard deviations above the middle, but that is just 
a guess. The task is to find out precisely where the upper region begins. 

c. Look at table A-1. Do some guessing. What if the line is drawn 
at +2.0 standard deviation units. How much of the area (and therefore 
probability) is in this upper region? From the table: 

1. At +2.0 s.d. above the middle, 0.477 is from to 2, thus 0.023 is 
above. But 0.023 is considerably less than 12 percent (or in a decimal 
0.12) so the cutoff point must be closer to the middle. Try 1.6 s.d. imits. 

2. At +1.6 s.d. above the middle, 0.445 is from to 1.6, thus 0.055 
is above. Still, 0.055 is less than 0.12. Keep trying. Look at +1.2 and 
+ 1.0. 

3. At +1.2 s.d. above the middle, 0.385 is from to 1.2, thus 0.115 
is above. 

4. At +1.0 s.d. above the middle, 0.341 is from to 1.0, thus 0.159 
is above. 

The closest you can get to 0.12, using table A-1, is to use +1.2. 
Thus, a cutoff of + 1.2 standard deviation vmits will give about 12 per- 
cent of the people tested — actually, it will give —0.115 or 11.5 percent, 
but you're dealing with approximations here, and not precise figures. 

d. The final step is to change this number of standard deviation 
units into a score. The standard deviation is 20. Thus, 1.2 standard 
deviation units will be 1.2 X 20 or 24 imits — ^but this is 24 units above 
the middle ("above the middle" because you are dealing with +1.2 
units) . The middle (mean) was 60, so 24 units above the middle takes 
you to 60 + 24 or a score of 84. Thus, the answer to the problem is that 
the cutoff point is 84. 



Index 



Achievement test batteries: best 
time for testing, 160; elemen- 
tary school, 158-60; high school, 
160-61 
Acquired behavioral dispositions, 

260-85 
Action-now evaluation, 302-3 
Affective measures, 260-85 
Affective objectives, taxonomy, 38 
Alternate-form reliability, 188-91 
Aptitude: defined, 76; vs. achieve- 
ment, 152-53 
Attitude and behavior, 261-63 
Attitude change measurement, 

263-67; desensitization, 264 
Attitude measures, 260-85; and 
baseline data, 278-79; equal-ap- 
pearing intervals, 275-77; Lik- 
ert-type scales, 267-74 
Attitude measures, standardized, 
172-75; cautionary notes, 174-75 
Auditory discrimination, 154 

Beggs, Donald L., 164 
Behavioral objectives, 17-45; and 
conditions of performance, 25- 



29; and evaluation, 18-19; and 
instructional strategy, 22; and 
specificity, 18; and terminal be- 
havior, 24-25; cautionary notes, 
54-59; comprehensiveness and 
acceptability of objectives, 62- 
70; depositories of, 69; English 
literature example, 42-43; in 
support of, 43-44; objections to, 
46-60; political science example, 
40-42; teacher help needed, 53, 
68-69; need for, 12-15 

Behavior, defined, 24 

Bloom, Benjamin, 36, 77 

Buros, O. K., 147 

Campbell, Donald T., 260 

Category matching, 110-11 

Change scores, 190-91 

Classroom questioning formats, 
92-127 

Classroom tests, 61-140; and ex- 
pectations for students, 83-87; 
example from elementary-school 
science, 132-34; example from 
English, 128-32; project, 135-40; 



317 



318 



Index 



assembling hints, 101; biased 
item sampling, 70-72; compre- 
hensiveness and acceptance of 
objectives, 62-70; comprehen- 
siveness of the measure, 70-76; 
conceptual scheme, 61-91; gen- 
eral considerations, 99-102; 
number of items, 99; random 
item sampling, 70-74; reporting 
the results, 87-88; test adminis- 
tration decisions, 76-80 
Cognitive levels, 36-67 
Completion questions, 108 
Comprehensive Test of Basic 

Skills, 254 
Computation question, 107-8 
Concurrent validity, 204 
Conditions of performance, 25-29; 
and reasonableness, 27-28; jus- 
tification of, 28 
Construct validity, 201, 209-12 
Content validity, 200, 201-4 
Continuous assessment, 296-301 
Corrected reliability coeflScient, 

194-96 
Corrected true-false item, 117-18 
Correlation coefficient, 181-85; 
product-moment contribution, 
182 
Criterion keying, 169; and dis- 
crimination, 169 
Criterion measure, 30-34 
Criterion-referenced tests: and re- 
liability, 194 
Criterion-related validity, 204-9 
Criticisms of behavioral objec- 
tives, 46-60; and fairness issue, 
46-52; and practicality, 52-54; 
cautionary notes, 54-59 

Diagnostic tests, 153-58; mastery 
tests, 156; problem identifica- 
tion, 157 

Differential Aptitude Test, 166 



Ebel, Robert L., 112 
Equal-appearing intervals, 275-77 
Essay-type items, 102-6; scoring, 

104-6 
Evaluation and assessment, 37 
Evaluation objectives, 290-91 
Evaluation sequence, 76-80 
Evaluation training, 303-4 
Expectancy tables, 213-19; and 
decisions, 216-17; with inexpe- 
rienced students, 217-18 

Grade equivalent scores, 249-56; 
cautionary notes, 252-56 

Hieronymus, A. N., 158 

Individualized instruction, 82 
In-level testing, 80-81, 255-56 
Intelligence tests. See Survey 

scholastic aptitude tests 
Internal consistency, 173, 181 
Interest measures, 277-78; by oc- 
cupational choice, 277-78 
Iowa Tests of Basic Skills, 158 
Iowa Test of Educational Devel- 
opment, 161 
Ipsative measure, 173 
Item format in standardized tests, 

175-78 
Item types, 92-127 

Krathwohl, D. R., 38 

Kuder Preference Records, 168-71 

Kuder-Richardson reliability, 187- 



Likert-type measures, 173, 267-74 

Marking systems, 256-59; and con- 
tracts, 256-57; and individual- 
ized instruction, 256 

Matching items, 109-10; category 
matching, 110-11 



Index 



319 



Maximum performance tests, 149- 
68; and lack of experience, 150- 
51 

Maximum vs. typical performance 
measures, 149-50 

Measurement vs. evaluation, 286 

Mental Measurements Yearbooks, 
147 

Minimum acceptance level, 30-34; 
and reasonableness, 33 

Multiple-choice items, 120-25; and 
numerical problems, 122; in 
published tests, 124; in teacher- 
made tests, 123-24 

Negative discriminators, 187 
Non-supply questions, 92, 108-25 

Opportvmistic learning, 55 
Over-defined constructs, 175 

Pearson product-moment correla- 
tion coeflScient, 181-85 
Percentile ranks and norms, 234- 
42; cautionary notes for use, 
239-42; quick computing meth- 
od, 237 
Performance measures, 93-96 
Personality construct, 172 
Personality measures, 172-75; cau- 
tionary notes, 174-75; construc- 
tion by content validity, 17 
empirical keying method, 173 
over-defined constructions, 175 
situation specific measures, 174- 
75 
Predictive validity, 204 
Project evaluation, 286-304; long- 
term effects, 287-89 
Psychomotor objectives, 38-39 
Published tests, differences from 
teacher-made, 144-46 

Reactive measures, 96 
Reading readiness tests, 154-55 



Readiness tests, 153-56 

Reliability, 180-98; alternate 
forms, 188-91; and criterion- 
referenced tests, 194; and valid- 
ity, 180; comparing techniques, 
191-92; correction for limited 
range of scores, 194-96; in- 
terpretation cautions, 192-93; 
Kuder - Richardson methods, 
187-88; split-halves, 180-86; test- 
retest, 188-91 

Scholastic aptitude tests, 162-66; 
and environment, 165; and ho- 
mogeneous grouping, 164; cau- 
tions for users, 165; factors not 
included in, 162-63; group vs. 
individualized, 162-63 

Score reporting, 231-59; grade 
equivalents, 249-56; percentile 
ranks, 234-42; raw scores, 232- 
34; standard scores, 242-56; 
stanine scores, 249; transforma- 
tions, 242-47; z-scores, 242 

Self-fulfilling prophecy, 90 

Serendipitous learning, 55 

Short-answer completion items, 
107 

Situation specific measures, 174-75 

Sociogram, 279-84 

Spearman-Brown prophecy for- 
mula, 184 

Special education tests, 157-58 

Specimen set, 148 

Split-halves reliability, 180-86; 
and internal consistency, 180 

Standard deviation computations, 
183 

Standard error of measurement, 
219-25; formula, 220; in inter- 
preting IQ scores, 224 

Standardization program, 142-44; 
problems with, 143 

Standardized tests, 141-79; and 



320 



Index 



unstandardized tests, 144-46; 
diagnostic uses, 153-58; infor- 
mation about, 149-50; item for- 
mats, 175-78; needs and avail- 
ability, 148-49; survey outline 
of, 148-49 

Standard scores and norms, 242-56 

Stanine scores, 249 

Strong Vocational Interest Blank, 
168-71 

Supply-type items, 92, 102-8 

Survey achievement tests, 158-61 

Survey scholastic aptitude tests, 
162-66 

Survey vocational batteries, 166- 
67 

Taxonomy of Educational Objec- 
tives, 36-37 
Test of Academic Progress, 161 
Test-retest reliability, 188-91 
Test sensitization, 190 
Test vs. measure, 149 
Thurstone, L. L., 275 
Timing of testing, 76-80 



True-false items, 112-20; at all 
cognitive levels, an example, 
118-20; corrected true-false 
items, 117-18 

Typical performance tests: and 
problem of "faking," 151-52 

Unobtrusive measures, 96-98 

Validity, 199-230; and range of 
scores, 209; cautionary notes, 
226-28; concurrent, 204; con- 
struct, 201, 209-12; content, 200, 
201-4; criterion-related, 201, 
204-9; predictive, 204 
Visual discrimination, 154 
Vocational interest measures, 168- 
71; cautionary notes in interpre- 
tation, 170 

Webb, Eugene J., 96 

Z-scores, 242 
Zwicker, B. L., 266 



Charles E. Merrill Publishing Company 

A Bell & Howell Company 

Columbus, Ohio 



(gl BELLbHOUJELL 



8947 



